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Preface 


Numerical analysis has advanced greatly since it began as a way of creating 
methods to approximate answers to mathematical questions. This book aims to 
bring students closer to the frontier regarding the numerical methods that are used. 
But this book is not only about newer, as well as “classical”, numerical methods. 
Rather the aim is to also explain how and why they work, or fail to work. This 
means that there is a significant amount of theory to be understood. Simple analyses 
can result in methods that usually work, but can then fail in certain circumstances, 
sometimes catastrophically. The causes of success of a numerical algorithm and its 
failure are both important. Without understanding the underlying theory, the rea- 
sons for a method’s success and failure remain mysterious, and we do not have a 
means to determine how to fix the problem(s). 

In this way, numerical analysis is a dialectic! between practice and theory; the 
practice being computation and programming, and the theory being mathematical 
analysis based on our model of computation. While we do indeed prove theorems in 
numerical analysis, the assumptions made in these theorems may not hold in many 
situations. Also, the conclusions may involve statements of the form “as n goes to 
infinity” (or “as h goes to zero”) while in actual computations n might not be 
especially large (or h especially small). Numerical analysis will sometimes ignore 
errors that we know exist (like roundoff error in the analysis of a method for solving 
differential equations). This is usually based on an understanding that some sources 
of errors are insignificant in a particular situation. Of course, there will be situations 
where roundoff error should be considered in the solution of differential equations, 
but only if the step size becomes unusually small. In that case, new analyses, and 
even new methods, may be necessary. 


' Dialectic is a dialog between a claim (a thesis) and counter-claims (an antithesis) hopefully 
leading to a new understanding (a synthesis) that incorporates both the original claim and the 
counter-claims. The synthesis is expected to give further understanding, but will itself eventually 
meet counter-claims. 
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Inspiration for this book has been found in the books of Atkinson and Han [11], 
Atkinson [13], Sauer [228], and Stoer and Bulirsch [241]. However, we wish to 
include current issues and interests that were not addressed in these books. 

We aim to present numerical methods and their analysis in the context of modern 
applications and models. For example, the standard asymptotic error analysis of 
differential equations gives no advantage to implicit methods, which have a much 
larger computational cost. But for “stiff? problems there is a clear, and often 
decisive, advantage to implicit methods. While “stiffness” can be hard to quantify, it 
is also common in applications. We also wish to emphasize multivariate problems 
alongside single-variable problems: multivariate problems are crucial for partial 
differential equations, optimization, and integration over high-dimensional spaces. 
We deal with issues regarding randomness, including pseudo-random number 
generators, stochastic differential equations, and randomized algorithms. Stochastic 
differential equations meet a need for incorporating randomness into differential 
equations. High-dimensional integration is needed for studying questions and 
models in data science and simulation. 

To summarize, I believe numerical analysis must be understood and taught in the 
context of applications, not simply as a discipline devoted solely to its own internal 
issues. Rather, these internal issues arise from understanding the common ground 
between analysis and applications. This is where the future of the discipline lies. 

I would like to thank the many people who have been supportive of this effort, or 
contributed to it in some way. I would like to thank (in alphabetical order) Jeongho 
Ahn, Kendall Atkinson, Bruce Ayati, Ibrahim Emirahmetoglu, Koung-Hee Leem, 
Paul Muhly, Ricardo Rosado-Ortiz, and Xueyu Zhu. My wife, Suely Oliveira, has a 
special thanks for both encouraging this project and having the patience for me to 
see it through. Finally, I would like to thank the staff at Springer for their interest 
and support for this book, most especially Donna Chernyk. 


How to Use This Book: 


Numerical analysis is a combination of theory and practice. The theory is a mixture 
of calculus and analysis with some algorithm analysis thrown in. Practice is 
computation and programming. The algorithms in the book are shown as 
pseudo-code. Working code for MATLAB and/or Julia can be found at https:// 
github.com/destewart2022/NumerAnal-Gradbook. The exercises are intended to 
develop both, and students need practice at both. 

Like most intellectual disciplines, numerical analysis is more a spiral than a 
straight line. There is no linear ordering of the topics that makes complete sense. 
Thus teaching from this book should not be a matter of starting at one cover and 
ending at the other. In any case, there is probably too much material and an 
instructor must of necessity choose what they wish to emphasize. Matrix 
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computations are foundational, but even just focusing on this could easily take a 
semester or more. Differential equations arise in many, many applications, but the 
technical issues in partial differential equations can be daunting. The treatment here 
aims to be accessible without “dumbing down” the material. Students in data 
science may want to focus on optimization, high-dimensional approximation, and 
high-dimensional integration. Randomness finds its way into many applications, 
whether we wish it or not. So, here is a plan that you might consider when you first 
teach from this book: 


e Chapter 1: A little on computing machinery to get started, floating point 
arithmetic, norms for vectors and matrices, and at least one-variable Taylor 
series with remainder. 

e Chapter 2: LU factorization for linear systems, linear least squares via the 
normal equations and the QR factorization as a black box. Eigenvalues can wait 
until later. 

e Chapter 3: Bisection, fixed-point, Newton, and secant methods are the foun- 
dation; guarded multivariate Newton and one-variable hybrid methods give 
good examples of how to modify algorithms for better reliability. 

e Chapter 4: Polynomial interpolation is so central that you need to cover this 
well, including the error formula and the Runge phenomenon; the Weierstrass 
and Jackson theorems (without proof) give a sense of rate of convergence. Cubic 
splines give useful alternatives to plain polynomials. Lebesgue numbers may 
appear abstract, but give a good sense of the reliability of interpolation schemes. 
Radial basis functions give an entry into high-dimensional approximation. 

e Chapter 5: Simple ideas can go a long way, but “integrate the interpolant” is a 
central idea; multivariate integration is also valuable here if you want to use it 
for partial differential equations. 

e Chapter 6: Basic methods for solving ordinary differential equations are still 
very useful, although the revolution brought about but John Butcher’s approach 
to Runge-Kutta methods is worth a look—if you have time. Partial differential 
equations need some more set-up time, but are worthwhile for more advanced 
students, or a second time around. The scale of the problems for partial dif- 
ferential equations means that you should point your students back to Chapter 2 
on how to solve large linear systems. 

e Chapter 7: Randomness is important, and some statistical computation using 
SVDs may be a doorway to these issues. Random algorithms are also very 
important, but often involve advanced ideas. 

e Chapter 8: Optimization has become more important in a number of areas, and 
including it is an option to consider. If your students want to do machine 
learning, some outline of the algorithms is available here. 


A second course could be focused on specific issues. A machine learning focus 
could include iterative methods and SVDs for matrix computations in Chapter 2, 
radial basis functions from Chapter 4, high-dimensional integration from Chapter 5, 
methods for large-scale optimization from Chapter 8, rounded out with some 
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analysis of random algorithms from Chapter 7. A simulation-based course would 
focus on approximation and interpolation in two or three dimensions from Chapter 
4, multi-dimensional integration from Chapter 5, much of the material from Chapter 
6 on differential equations, both ordinary and partial. Uncertainty quantification 
could be served by starting with Chapter 7, progressing to iterative methods for 
large linear systems in Chapter 2, radial basis functions in Chapter 4, perhaps some 
partial differential equations in Chapter 6 and optimization in Chapter 8. 


Notes on Notation: 


Scalars are usually denoted by lower case italic letters (such as x, y, z, %, 6) while 
vectors are usually denoted by lower case bold letters (such as x, y, z, «, f). 
Matrices are usually denoted by upper case italic letters (X, Y, A, B). Entries of a 
vector x are scalars, and so denoted x;; entries of a matrix A are denoted aij. 
Matrices and vectors usually have indexes that are integers i= 1,2,...,m, j= 
1,2,...,n for an m x n matrix. Sometimes it is convenient to have index sets that 
are different, so a matrix A = [aj|i € R, 7 € C] can have row index set R and 
column index set C, and x = |x;|i€ C] has index set C. This means Ax = 


[vicc GijXj|i € R]| is the matrix—vector product. Just as A’ is used to denote the 


transpose of A, A~’ is the inverse of A’, which is also the transpose of A™!. 

Functions are usually introduced by naming them. For example, the squaring 
function can be introduced as q(x) = x*. Functions of several variables can be 
introduced as f(x,y) = x* +y* or using vectors as f(x) = x"x. Anonymous func- 
tions can be introduced as x +> x"x. 

Pseudo-code uses “<—” to assign a value to a variable (such as x <— y assigns the 
value of y to x) while “=” tests for equality (where x = y returns true if x and y have 
the same value). 

In some occasions, “:=” is used to define a quantity of function where using “=” 
might be ambiguous. The sets R, C, and Z are understood to be the set of real 
numbers, the set of complex numbers, and the set of integers, respectively. 


Iowa, USA David E. Stewart 
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Chapter 1 ®) 
Basics of Numerical Computation cree 


There was a time when computers were people [110]. But since the late 1950s 
with the arrival and development of electronic digital computers, people who did 
the calculations were being replaced by machines. The history of computing in 
NASA revealed in Hidden Figures [233] gives an example of this. The machines 
have advanced enormously since that time. Gordon Moore, one of the founders of 
Intel, came up with his famous “law” in 1965 that the number of transistors on a 
single “chip” of silicon doubled every year [179]. Since then the time for doubling 
has stretched out from 1 year to over 2 years, but the exponential growth of the 
capabilities of computer hardware has not yet stopped. 

Numerical computations have been an essential part of the work done by com- 
puters since the earliest days of electronic computers. An understanding of how they 
work is essential if you want to get the best performance out of your computer. And 
the scale of numerical computations can be truly enormous. In this chapter, some 
aspects of how computers function will be explained, along with how that affects 
numerical methods and numerical software. 


1.1 How Computers Work 


The heart of any computer is the Central Processing Unit (CPU). Since the 1980s they 
are usually on a single piece of silicon—on a single “chip”. Other essential parts of a 
computer are its memory, which almost always includes fast-access writable memory 
(usually referred to as RAM for Random Access Memory), and permanent storage 
(such as a hard disk drive or solid-state drive), and means of communicating with 
the outside world. Communicating with the outside world can be through keyboard, 
mouse, Wi-Fi, and display for your laptop, a direct Internet connection, or a data 
link to a large-scale storage system. Perhaps you have other sensors, like a camera, 
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a microphone, as well as sensory output devices, such as a headphone set. However 
it communicates with the outside world, it is essential to do so. 

But here we focus on the “beating heart” of any computer: the CPU. This is where 
the computations are done. How a CPU is organized and performs its computations 
is the subject of computer architecture. There are many books on this subject, such 
as [24, 192]. 


1.1.1 The Central Processing Unit 


Modern computers are sometimes referred to as electronic digital stored program 
computers. Electronic refers to the physical nature of the data in the computer. Digital 
means that only discrete values of these physical values are important, rather than an 
analog computer which uses continuous values of voltage, for example, to represent 
the computed quantities. Stored Program means that the task (the program) is stored 
in memory, rather than being physically fixed. 

The fact of a computer being a stored program computer means that the CPU has 
to retrieve the operations to perform as well as the data to apply the operations to. 
This also means that the CPU has to have a control unit for decoding the instructions 
in the program, as well as an arithmetic and logic unit (ALU) which does the actual 
computations. In addition, the CPU has its own fast-access memory. Every CPU has, 
at a minimum, a number of registers which are fixed small units of memory that are 
easily accessed to computations. 


1.1.2. Code and Data 


The instructions are in machine code, which is the basic “language” that the CPU 
understands and executes. The CPU reads an instruction from memory, and then sets 
about executing the instruction by sending signals to the arithmetic and logic unit, 
the memory system, or some communication system. Executing an instruction will 
typically involve reading some data item from a register, sending it to the appropriate 
input of the arithmetic and logic unit, performing the appropriate operations, and 
sending the result back to a certain register to await the next operation, or being 
written to memory. Some operations simply read a specified memory location and 
stores the result in a specified register. Other operations do the reverse: saves the 
contents of a specified register in a specified memory location. 

The machine code used by a CPU is an important part of its design. Intel has been 
producing CPUs for longer than any other current producer (as of time of writing). 
Intel’s commercial strategy has dictated that new CPUs in their line have essentially 
the same machine code as older CPUs, bolstered by a few new instructions to provide 
some additional features. This is an example of backwards compatibility, where old 
computer code can still be executed without change. As a result, while Intel CPUs 
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have many thousands times as many transistors as their CPUs from the 1980s, they 
run essentially the same machine code. On the other hand, there was a revolution in 
1990s in CPU design which emphasized the simplicity of their architecture. These 
CPUs were called RISC for reduced instruction set computers. The current processors 
in this family are the ARM (for advanced RISC machine) CPUs [84]. ARM CPUs 
are currently found in cell phones and tablets and other devices such as Raspberry 
Pi systems. Their chief advantage for these applications is low power consumption. 
Other instruction sets are available, such as the public domain RISC V (pronounced 
“risk-5”’) instruction set [199]. There are a number of hardware implementations of 
the RISC V instruction set, although few are ready-to-use computers. 


1.1.2.1 Read—Execute Cycle 


The read—execute cycle is the cycle of reading an instruction, decoding the instruc- 
tion(s), and executing the instructions. This cycle is controlled by a clock. Ideally, a 
basic instruction can be executed in one clock cycle. However, the chase after ever 
shorter clock cycles has resulted in instructions executing over multiple clock cycles 
while the CPU can still complete close to one instruction each clock cycle. To achieve 
this, the CPU uses a pipelined architecture: computations are spread over multiple 
units. Each unit carries out a basic step in one clock cycle, then passes its results to 
the next unit in the pipeline. 

The vast numbers of transistors in modern CPUs means that there are opportunities 
to exploit parallelism across the different components of the CPU. This might take 
a basic form, such as where an integer processing unit is used to control a for 
loop, while a floating point processing unit is simultaneously used to carry out the 
main computations in the body of the loop. Multiple computational units can also 
be harnessed in parallel, forming a vector architecture. These possibilities provide 
opportunities for greater performance, but require more complex control units in the 
CPU. 


1.1.2.2. Memory Hierarchies 


One of the bottlenecks in modern CPUs is reading and writing data from or to memory. 
More transistors on a chip means smaller transistors that are closer together. This 
means that switching times are shorter, and signal transmission times within the CPU 
are shorter. However, the time needed to send data to RAM or read data from RAM 
has not become much shorter. As a result, the time needed just to store or read data 
has become much larger than the time for a floating point operation. 

To deal with this, computer architecture has developed memory hierarchies, simi- 
lar to that shown in Figure 1.1.1. At the top of the hierarchy is the small, fast memory 
of registers. Typically, the contents of a register are available in a single clock cycle. 
The next level(s) consists of cache memory. This memory in level 1 cache can be read 
from or written to a register in one to a few clock cycles depending on the specific 
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Fig. 1.1.1 Memory hierarchy 


architecture. Rather than require the programmer to remember if a piece of data is 
currently in cache memory, the hardware keeps track of this. If a requested data item 
is not in level 1 cache, the system then looks in level 2 cache. If the requested item is 
in level 2 cache it will be transferred to level 1 cache, and can then be transferred to 
a register for computations on it. Cache memory is organized into cache lines. Each 
cache line is a single block of consecutive memory locations. The size of this block 
can depend on the level of the cache. Level 1 cache typically (at time of writing) has 
a cache line size of 64 bytes. Some architectures have a level 2 cache line size of 128 
bytes. 

If a data item is not in level 2 cache memory, then the system goes to level 3 cache 
(if present), or to main memory (RAM). Since RAM is not on the chip itself, the 
request goes to another piece of silicon on the same printed circuit board. Instead of 
taking dozens of clock cycles to obtain a data item from level 2 cache, obtaining a 
data item from RAM typically takes hundreds of clock cycles. Data transferred from 
RAM is typically transferred in much larger blocks than from cache memory. 

Almost all modern computers implement virtual memory, which means that not 
all data in use by the computer is stored in RAM. Some memory blocks not in RAM 
can be stored in more permanent memory such as the hard disk drive or solid-state 
drive of the computer. In this way, the computer can work with larger data sets directly 
without having explicit instructions to save data in this way. The time to access a 
data item on a hard disk is typically measured in milliseconds and thus millions of 
clock cycles in current computers. 

Beyond the permanent storage of a computer is the memory that is accessible 
over networks, such as the Internet. This gives the computer access to much larger 
memory systems, but the time to access them is again much larger—measured in 
seconds or billions of clock cycles, rather than milliseconds for permanent storage. 

This creates a hierarchy of memory systems called a memory hierarchy. At each 
level of this hierarchy, there has been a way of remembering where in main memory, 
or the virtual address space, each data item belongs. This is done using a tag for 
each block of memory and a translation look-aside buffer (TLB) containing the tags 
and other data for each block of memory at a given level. There also needs to be a bit 
for each block of memory indicating whether it needs to be written back to the next 
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lower level of memory or not. If a block needs to be written back to the next level, 
it is called dirty. 

Since each level of the memory hierarchy contains a fixed amount of memory, 
when a new block is read in, there must be a way of deciding which block of memory 
must be removed. If that block is “dirty” then it must be written back to the next 
lower level of memory first. Various algorithms are used to decide which block of 
memory is to be removed. The most common of these is the least recently used 
(LRU) heuristic. Implementing this correctly in a deep memory hierarchy requires 
co-ordination between the different levels. 

As we go down the memory hierarchy, the amount of memory available increases 
enormously, while the access times also increase enormously. The rate of data transfer 
decreases significantly as we go down the memory hierarchy as well. This means that 
to obtain best performance, data should be re-used at the top levels of the memory 
hierarchy as much as possible. 


1.1.2.3. Thrashing 


Thrashing is what happens when you are trying to fit too much into a level of the 
memory hierarchy. This can happen at any level of the memory hierarchy and was 
first noticed in dealing with virtual memory. Virtual memory makes it appear to a 
programmer that there is more memory available to a program than can fit in main 
memory by using permanent storage, like a hard disk. If accessing data that is outside 
of main memory is done sporadically, then there is little cost to this. But if the program 
is regularly scanning the entire data set, then data transfers from permanent storage 
to main memory are done with every step of the program. The program slows down 
to the speed with which data can be transferred from permanent storage. This can 
slow down programs by a factor of 1,000 or more. 

Thrashing can also happen between cache memory and main memory. If the 
program is trying to scan a data set that does not fit in cache memory, then the 
computer will find itself loading data from main memory into cache memory with 
every step of the program. The program will slow down to the speed with which data 
can be transferred from main memory, slowing the program by a factor of 10 to 100. 

The way to reduce thrashing is to re-use data when loaded as much as possible, 
and to use data that is nearby in memory as much as possible. 


1.1.2.4 Code Versus Data Caches 


Many modern CPU architectures include separate level 1 caches for data and code. 
Because of the different memory access patterns for code and for data, this is often 
a good idea. Code usually should not change during execution, so there is often no 
need for “dirty” bit in a cache for code. Modern CPU architectures also often have a 
specialized kind of cache for code called a micro-operation cache or Lop cache. The 
idea is that the front end of the control unit decodes machine code instructions into 
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a stream of micro-operations (or Lops) that can be more directly executed by the 
hardware. The back end of the control unit reads this stream of micro-instructions 
and can parallelize or re-order these micro-instructions to better use the hardware. 

Short loops can often be converted into a short sequence of micro-operations that 
can fit into the op cache. This avoids the need for further memory access to get 
instructions for the duration of the loop. 


1.1.2.5  Multi-core CPUs 


As the number of transistors in a CPU has grown, there are more opportunities for 
parallel computation. Multiple functional units in the arithmetic-logic unit were an 
early sign of this, but in the past decade, it has become more explicit with CPUs 
now having multiple cores. A core is essentially a mini-CPU with its own control 
unit, arithmetic-logic unit(s), registers, and cache. However, multiple cores share the 
same “chip” as well as higher level cache, connections to main memory, and to the 
outside world. 

When a CPU has multiple functional units, the programmer does not have to be 
aware of them in order to use them: the CPU hardware identifies opportunities for 
this form of parallelism and uses them. Multiple cores, however, require some kind 
of explicit parallel computing. Programming with threads is the most common way 
of exploiting multi-core CPUs, although this is not the only way to do so. There are 
other reasons for programming with threads—for example, having one thread in a 
program doing computations, while another thread handles user interactions. 

More information about parallel computing can be found in Section 1.1.6. 


1.1.3 On Being Correct 


As a practical matter, it is important for your algorithms and implementations to be 
correct. Ideally, we would prove that our algorithms are correct before we implement 
them. And once they are implemented, prove that our implementation is correct. 
Indeed, proof of correctness techniques are valuable tools in the arsenal of software 
engineers in their fight against bugs. However, it is infuriatingly common to look at 
some code we have just written and not see some flaw that another person would see 
in a moment. Or a simple test would reveal. 

It is therefore imperative to test your code. Build tests alongside your code, or 
even before you code. Test as you code. Try to avoid writing a large piece of software 
and then test it, because the largest part of the art of debugging is finding the error. 
Once you find the error, it is usually easy to understand why it is wrong. Most times 
the fix is also quite clear. Sometimes the fix is not clear, in which case there may be 
a deeper misunderstanding about your algorithm that needs to be resolved first. But 
if you have a large piece of software, you have a much larger area to search if the 
results are incorrect. 
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If you are modifying an already tested and accepted piece of code, keep the old 
code for comparison. Revision control systems (such as Git, and older systems such 
as the Unix RCS system) are useful for doing this. As instances of bugs arise, add 
an extra test to your list of tests. 


1.1.4 On Being Efficient 


While computing hardware has become more efficient, the software—the algorithms 
and data structures—have also become more efficient. Of course, for many computing 
tasks, such as sorting a list of data items, there are some hard limits on how efficient 
the algorithm can be. 

But a numerical analyst should understand the efficiency of algorithms and how 
that relates to computing practice. 


1.1.4.1 Big “Oh” Notation 


The computational resources needed by an algorithm are often described using “big- 
Oh” asymptotic notation. The time needed (or time complexity) to process an input 
with n data items might be a function T (nm), but it is often very difficult to obtain 
a precise formula for T(n), which in any case can vary with the CPU, the clock 
speed, the operating system, the programming language, the particular compiler or 
interpreter used, or even various options used by the compiler or interpreter. So it 
is often better to simply give an asymptotic estimate for the time taken of the form 
“T (n) = O(g(n)) asn > 00” which means 


there are constants C, 9 > 0 where n > no implies T(n) < C g(n). 


The constants C and ng are called “hidden constants” as they are not mentioned in the 
“O” statement “T(n) = O(g(n))”. These hidden constants are often dependent on 
the hardware and software used as well as the way the method is implemented. Inter- 
preted languages (like Basic, Python, or Ruby) are typically slower than compiled 
languages (like C/C++, Fortran, or Go), which changes the value of the constant C. 
Doubling the clock speed halves the value of C. Compiler or interpreter optimiza- 
tion as well as the details of how an algorithm is implemented can strongly affect 
the value of C. The cache size can affect the value of no as well as C. 

Because of all these issues, it is better to tell the reader something that does not 
depend on implementation details, like “mergesort takes O(n logn) time to sort n 
items’. This is true whether mergesort is implemented in Basic on a 1990-vintage 
Mac, or a C++ version on the latest supercomputer, even if the times taken by the 
two implementations are very different. We know roughly how the time taken will 
change as the value of n becomes large. 
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Table 1.1.1 Asymptotic resource usage for some sorting algorithms for n items 


Algorithm Time Memory Extra memory 
bubblesort O(n?) O(n) Od) 
insertion sort O(n?) O(n) O(1) 

shellsort O(nJ/n) O(n) O(1) 

with gap sequence 2 [2-4 ! n| +1 

quicksort O(n logn) O(n) O(logn) 
mergesort O(n logn) O(n) O(n) 


Variations of this notation look at what happens as a parameter becomes small. 
For example, the step size h used for approximately solving differential equations 
should be small. We say “f(h) = O(g(h)) as h | 0” means that there are C and 
ho > 0 where 

|f(h)| < Cg(h) | provided 0 <h < ho. 


Lower bounds are designated using “big-&2” notation: “f(n) = Q(g(n)) asn > co” 
means there are C and ng where 


f(n)=>Ceg(n) _ providedn > no. 


Finally “big-©” notation combines these: “f(m) = ©(g(n)) asm — oo” means that 
both f(n) = O(g(n)) and g(n) = O(f(n)) asn — oo. Analogous definitions hold 
as the input parameter becomes small. For example, 


1 
ea ltxt 5x + OG) asx > 0 


from the Taylor series for the exponential function. 


1.1.4.2 Using Efficiency Measures 


We use the “big-Oh” notation for measuring the performance of algorithms. We can 
use this for measuring and reporting the amount of time needed by an algorithm, and 
the amount of memory needed. Table 1.1.1 shows the “big-Oh” measures of the time 
and memory needed for some well-known sorting algorithms (see [59]). 

The time needed for quicksort and mergesort is asymptotically equivalent. Fur- 
thermore, what the table does not show is that this is guaranteed for mergesort, but 
for quicksort it is only for the average value over randomized choices during the 
algorithm. Why then is quicksort so much more popular than mergesort?! Because 


' Yes, quicksort is more popular. For example, the standard C library includes a quicksort function, 
but no mergesort function. 
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Algorithm 1 Recursive factorial function 


1 function factorial(n) 


2 if n=0: return 1 

3 else if n>0O: return n- factorial(n — 1) 
4 else: return 0 

5 end if 

6 end function 


mergesort requires substantial (O()) extra memory, which must be allocated, and 
later deallocated. The quicksort algorithm does require O(log n) extra memory, but 
this is in the quicksort recursion, and so does not require explicit allocation. This 
makes quicksort a more practical and attractive algorithm. With quicksort, you can 
also push the size of the input to near the maximum amount of memory available 
without creating memory overflow. 

Using efficient algorithms is the start to making great and efficient programs. 
Use standard implementations first, such as available in the standard library for your 
programming language. 


1.1.4.3 Optimizing Code 


Premature optimization is the root of all evil. 
Donald Knuth 


For programmers with some experience, it is tempting to make each piece of code 
efficient. This can be a mistake. Each additional “optimization” can 


e make the code more confusing, 
e introduce hidden dependencies (or worse, outright bugs), and 
e require careful documentation. 


It is better to write code at the level of generality appropriate for the algorithm being 
implemented. If you need the code to be faster in a particular situation, profile the 
code first to see where the code is actually spending its time. Focus attention on 
that part of the code. Keep your original code. Then you can check the results and 
performance of your new code against the old code to ensure that they do the same 
thing (or are, at least, consistent). 


1.1.5 Recursive Algorithms and Induction 


A recursive function is a function that, either directly or indirectly, calls itself. A 
standard example is the factorial function where n! =n -(n — 1)-(n —2)---3-2- 
1 forn = 1,2,... and 0! = 1 in Algorithm 1. 
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Algorithm 2 Recursive summation 


1 function sum_recursive(a, i, j) 

2 if i>j: return 0 

3 else if i=j: return aq; 

4 else 

5 m<|Git+j7)/2] // @+ 7)/2 rounded down 

6 return sum_recursive(a, i,m) + sum_recursive(a,m-+ 1, j) 
7 end if 

8 end function 


1.1.5.1 Principles of Induction 


The principle of mathematical induction is complementary to recursive functions in 
computing. Suppose that P(x) is a condition on x, x a non-negative integer, satisfies 
the properties that P(O) is true, and P(k) true implies P(k + 1) true as well. The 
principle of induction states that any such condition is true for all non-negative 
integers: P(x) is true for x = 0, 1,2,3,.... The principle of complete induction 
applies to any condition P(x) that is true for x = 0 and P(/) is true for all non- 
negative integers j between zero and k (inclusive) implies P(k + 1) is also true. 
Again, if P(x) is a condition on x having these properties, then P(x) is true for 
x =0,1,2,3,.... The principle of complete induction can be proved by applying 
the principle of induction to the condition Q(x) which states that P(y) is true for all 
integers 0 < y <x. 

How do we know that the factorial function does indeed compute the factorial of 
n?Forn = Owe get the correct value from the first i £. If we compute the correct value 
for n = k with k a non-negative integer, then for n = k + | the computed value is 
(k + 1) - factorial(k) which is, by the induction hypothesis (k + 1) - (k!) = (k+ 1)! 
as we wanted. Thus, by the principle of mathematical induction, it computes the 
correct value forn = 0, 1, 2,3,.... 

For another example, consider the recursive algorithm for computing a sum 
>i; & in Algorithm 2. 

To prove that this correctly computes the sum, we let n = (j — i + 1) which is 
the number of terms of the sum forn > 0. Let P(n) be the condition that the sum of n 
terms is computed correctly by sum_recursive(). This is clearly true forn < O as then 
the first return is taken. If nm = 1 then P(1) is true as the sum is just a; as returned 
by the second return. Now suppose that P(x) is true for all 0 < k < n. We want to 
show that P(n + 1) is also true. Now sum_recursive(a, i, m) is computed correctly 
as the number of terms in the sumism — i + 1 <n; also sum_recursive(a,m + 1, j) 
is computed correctly as the number of terms in the sum is j —m +1 <n. Then 
adding these two values gives ) 7; dk + > fin4 1 4k = > -{_; ak as we wanted. Then 
by the principle of complete induction, P(n) is true for n = 0, 1,2,... and so no 
matter the number of terms in the sum, sum_recursive(a, i, j) does indeed correctly 
compute )~7_; ax. 
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Fig. 1.1.2. Stack frame structure 


1.1.5.2 Implementation of Recursion 


Early computer systems did not make allowance for recursion, but that had changed 
by the 1970s. Some early computer systems did use recursion, such as Lisp in 1958 
by John McCarthy, where recursion was not only permitted, it was central to the 
whole design. In 1960, the programming language Algol was developed which also 
permitted recursion. By the early 1980s, recursion was a basic part of all computer 
systems. 

The key to implementing recursion is to have a stack which is a data structure 
where data items can be pushed onto the top of the stack, or pulled (or popped) from 
the top of the stack. Stacks can be used to create a general mechanism for calling a 
function that makes no distinction between whether the call is to another function or 
is recursive. 

Figure 1.1.2 shows the structure of a stack frame. 

As an example consider the code 


function g(u,v) 
wW=Ux*vU 
return w+7xv 
end function 


x = #(3, 7) 


// veturn address points near here 


Just before the “+” operation on the line with the return, the top of the stack 
would look like Figure 1.1.3a; just before executing the return statement is shown in 
Figure 1.1.3b; after assignment of the return value to x is shown in Figure 1.1.3c. 


1.1.6 Working in Groups: Parallel Computing 


Explicit parallel computing is beyond the scope of this book, but suitable references 
can be found in [247]. Computing with large numbers of threads is also used for 


1 


Basics of Numerical Computation 


top of stack 


T*v: 49 


w: 21 


vi 7 
u: 3 


top of stack 


return address 


return address 


return value return: 70 
x location x location top of stack 
x: 0 x: 0 x: 70 
(A) Before (B) Just be- (c) After 
oe in fore return assign- 
return ment to x 


Fig. 1.1.3. Stack frames 


programming Graphical Processing Units (GPUs) [87] to do large-scale numerical 
computations [204, Chap. 11]. 

Parallel computing can happen at a number of different levels. Parallel computing 
can happen at the level of individual cores through SIMD instructions that apply 
the same operation to multiple data items simultaneously, or multiple computational 
units working simultaneously on different data items. It can happen through having 
multiple “cores”, each of which is a CPU in its own right, with each core working 
on a different computation. Parallel computing can happen with multiple “chips”, 
each with multiple cores, working on different computations. Or multiple parallel 
computers can coordinate computations over the Internet. 

Parallel computation at the lowest level, using hardware units simultaneously 
as controlled by the control unit of each core, does not require explicitly parallel 
instructions from the programmer. However, there are ways of writing code that 
makes it possible (or makes it difficult) for the compiler and the hardware to identify 
this parallelism. Consider in Algorithm 3 which shows two ways of computing the 
inner product x’ y = )~"_, x; y;. The second variant uses loop unrolling to make it 
clear that the summations over 51, 52, 53, and s4 in lines 3-8 can be done independently 
and in parallel. 

Parallelism at the level of using separate cores is often called fine-grained par- 
allelism. This is often best achieved using either a thread-based method such as 
PThreads or OpenMP in C or C++ [49, 55, 175]. For matrix operations, PLAPACK 
(1997) also exploits thread-level parallelism [250]. Other languages, such as Python 
and Julia, have methods of creating and running threads in parallel—whether or not 
they make use of additional hardware, or use time sharing to share the same hard- 
ware. Threads are often used for purposes other than improving overall computational 
speed. For example, one thread in a program may be performing computations, while 
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Algorithm 3 Inner product and loop unrolling 


1 function innerprod(x, y) 1 function innerprod2(x, y) 
2 s<0O 2 51,52, 53, 84 <0 
3 for i=1,2,...,7 3 for j=0,1,2,...,|n/4] -1 
4 S<—SH+XzYi 4 $1 << 81 + X4j41¥4j41 
5 end for 5 S2 <— $2 + X4j42N4j42 
6 return s 6 §3<— $3 + X4j434j+3 
7 end function 7 $4 <— Sq t+X4j44yaj+a 
8 end for 


9 S<—sjy+5.+534+ 584 
10 for 1=4|n/4|+4+1,...,n 


LA S<—S+XY; 
12 end for 
13 return s 


14 end function 


another thread is handling user interaction, so as to avoid situations where a program 
appears “dead” while focusing solely on computation. 

Multiple cores on a single CPU often have separate level-1 caches while sharing 
common level-2 and higher level caches. Thus multiple cores ultimately share the 
same memory. Threads reflect this: while separate threads have separate system 
stacks, the data and code accessed belong to the same memory system. This is called 
shared-memory parallelism. Thus, there can be conflicts between threads due to 
separate threads trying to simultaneously write to the same memory location, or one 
thread might be reading from a location in memory while another thread is writing to 
that same location. Thus, it is easy to corrupt data in thread-based systems. To prevent 
this, programmers use a combination of software and hardware facilities to implement 
locks and semaphores for controlling access to common memory areas and common 
variables. Some operations have to be made atomic; an atomic operation is one 
that cannot be interrupted—it cannot be split into parts. Closing and opening locks 
typically need to be atomic operations. It is common in shared-memory parallelism 
to lock access to common variable, perform an operation with that variable, and 
then open (or release) the lock so that other threads can then access that variable. 
OpenMP takes care of most of these details. The trade-off in using OpenMP is that 
it is somewhat restrictive in terms of the types of parallelism that can be achieved. 

There are generally two to eight cores on most CPUs at time of writing. These 
numbers will undoubtedly increase over time. On the other hand, Graphical Process- 
ing Units (GPUs) typically have hundreds or even thousands of cores on a single 
unit. 

As an example of how shared-memory parallelism can be exploited, consider the 
recursive summation function in Algorithm 2. This can be turned into an efficient 
shared-memory parallel function for summation. The parallel sections are marked by 
parallel ... end parallel; each statement in the parallel block of code 
is meant to run on a separate thread in parallel. 

This approach to parallel summation is called parallel reduction, and is appli- 
cable to many other operations besides addition. Parallel reduction can be applied 
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Algorithm 4 Parallel summation 


1 function sum_parallel(a, i, j) 


2 if i>j: return 0 

3 else if i=j: return aq; 

4 else 

5 m <— |G + j)/2] 

6 parallel 

7 5, < sum_parallel(a, i,m) 

8 52 < sum_parallel(a,m + 1, j) 
9 end parallel 

10 return s, +52 

4. end if 


12 end function 


to any operation that is associative: (a * b) * c = a * (b * c). Parallel reduction can 
compute a, * a2 * +++ * da, in [log, n| parallel “x” operations assuming an unlimited 
number of parallel threads. 

Using multiple CPUs means that memory is not shared; this is distributed memory 
parallelism. Data is passed between CPUs by passing messages over a communi- 
cations network. This might be a dedicated network or even the Internet. The best 
known system for implementing distributed memory parallelism is the Message 
Passing Interface (MPI) [113]. MPI was designed to use with Fortran and C/C++, 
but most languages (including Java, Julia, Python, and R) have libraries with MPI 
bindings. Julia comes with modules for distributed memory parallelism. 

In distributed memory parallelism, there is a cost of moving data that does not 
occur in shared-memory parallelism. However, distributed processing removes mem- 
ory access congestion that often occurs in shared memory parallelism. Separate 
level-1 cache for cores relieves memory access congestion to some extent for shared- 
memory parallelism, but to obtain very high speed-ups, distributed memory paral- 
lelism is necessary. 

To obtain best performance, shared and distributed parallelism need to be com- 
bined. MPI-3 has features to do this; otherwise, OpenMP and MPI should be com- 
bined. 


1.1.7 BLAS and LAPACK 


Linear algebra tends to dominate much of numerical analysis. Large problems usu- 
ally involve large matrices. So doing matrix computations fast has been a focus of 
numerical analysts for many years. Basic Linear Algebra Subprograms (BLAS) was 
first published in 1979 [157], extended to BLAS-2 [81] in 1988 and BLAS-3 [79] 
in 1990. While BLAS concerns itself with operations on vectors, BLAS-2 concerns 
matrix—vector operations, and BLAS-3 concerns matrix—matrix operations. The level 
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of the BLAS (indicated by the “-2” or “-3”) indicates the depth of nested loops needed 
for a naive implementation. 

Thus, BLAS (or BLAS-1) deals with vector addition, scalar multiplication of 
vectors, and dot products and lengths of vectors; BLAS-2 deals with matrix— 
vector products, matrix—transpose—vector products, solving triangular linear systems 
(see Section 2.1); BLAS-3 deals with matrix—matrix products including matrix— 
transpose—matrix products, and solving systems of the form TX = Y for X where 
T is a triangular matrix and Y is a matrix right-hand side. 

The LINPACK numerical package [82] developed in the 1970s and early 1980s, 
targeting mainly supercomputers, used the BLAS-1 library. The follow-up package 
LAPACK [5] (1992) used all levels of BLAS, with particular focus on using BLAS-3 
with block algorithms wherever possible. 

To understand the performance benefit of using blocked algorithms, consider three 
operations from the different levels of BLAS: dot products from BLAS-1, matrix— 
vector products from BLAS-2, and matrix—matrix products from BLAS-3. Modern 
processes can perform floating point operations far faster than data can be transferred 
from main memory. So it is important to re-use data brought in from main memory 
as much as possible. Table 1.1.2 shows the number of data items to be transferred, 
the number of floating point operations, and the ratio of the two for n-dimensional 
vectors and n x n matrices. 

As can be seen in Table 1.1.2, the highest ratio of flops to data transfer occurs 
with BLAS-3 operations. While it can be tempting to think that we can make the 
value of n large for even better performance with BLAS-3 operations, when the 
data needed to do the operations overflows the CPU cache, the benefit is lost (see 
thrashing in Section 1.1.2.3). LAPACK makes full use of these operations by using 
block operations on b x b matrices, where b is chosen so that the blocks fit in cache 
memory. 

BLAS implementations are available for various computer architectures as well 
as a reference implementation available from net1ib [80]. The reference imple- 
mentation should only be used to check correctness of a BLAS implementation. 
Vendor-provided BLAS tends to provide optimal performance for their architecture, 
such as Intel’s Mathematics Kernel Library (MKL) [132] or nVIDIA’s cuBLAS for 
running on their GPUs. ATLAS (automatically tuned linear algebra software) [259] 
is open-source software that generates partly optimized BLAS implementations in C 
or Fortran by timing various operations to obtain a near-optimal selection of strate- 
gies for improving performance. While not as highly optimized as vendor-provided 
implementations, for new architectures ATLAS quickly provides a BLAS implemen- 
tation that is hard to beat. 


Exercises. 
(1) Write code for multiplying a pair of n x n matrices (C < AB): 
For r= 152... 


for fH 1,2,..00,7 
cj <— 0 
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Table 1.1.2 BLAS level, flops, and data transfer 


BLAS level Operation Flops Data transfer Ratio 

BLAS-1 dot product 2n 2n 1 

BLAS-2 matrix—vector 2n? n> +2n x2 
product 

BLAS-3 matrix—matrix 2n3 2n? n 
product 


(2) 


(3) 


for k=1,2,...,n 
Cig — Cig + inde; 
end for 
end for 
end for 


Do this in your favorite interpreted language (MATLAB, Python, Ruby, ...). Use 
your code for multiplying a pair of 100 x 100 matrices. If your language has a 
built-in operation for performing matrix multiplication, use this to compute the 
matrix product of your test matrices. How much faster is the built-in operation? 
The matrix operation C < C + AB onn x n matrices can be written as three 
nested loops: 


for tl 2y 0.5 
For fH l, 230.557 
for k=1,2,...,n 
Cig — Cig + Gin dE; 
end for 
end for 
end for 


The order of the loops can be changed; in fact, any order of “for i”,“for j” 
and “for k” can be used giving a total of 3! = 6 possible orderings. In a com- 
piled language (such as Fortran, C/C++, Java, or Julia), time the six possible 
orderings. Which one is faster. Can you explain why? 

A rule of thumb for high-performance computing is that when a data item is read 
in, we should use it as much as possible, rather than re-reading that data item 
many times and doing a little computing on it each time. In matrix multiplication, 
this can be achieved by subdividing each matrix into b x b blocks. For n x n 
matrices A and B, we can let m = n/b and write 


Ai Aiz +++ Aim By Biz --+ Bim Cy Cyr +++ Cim 


Azt A22 +++ Adm Bo, Boo «++ Bom Co Cx +++ Com 
AB= : = f i bor Sd et : = Be: : 


Ami Am2 +++ Amm Bmi Bm2 +++ Bnm Cm Cm2 ie Cinm 


1.1 


(4) 


(5) 


(6) 


(7) 


(8) 


(9) 
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Then for i, j = 1,2,...,n, Cij — Do¢L, Aik Buj. AS long as we can keep all 
of Ajx, Byj, and Cj; in cache memory at once, the update Cj; <— Cij + Aix By; 
can be done without any memory transfers once A;, and B,; have been loaded 
into memory. Write out a blocked version of the matrix multiplication algorithm 
and count the number of memory transfers with the blocked and original matrix 
multiplication algorithms. 

Implement the original and unrolled inner product algorithms in Algorithm 3 as 
a function in your favorite programming language. Time the function for taking 
the product of two vectors of 10° entries. Use the simpler version to check the 
correctness of the unrolled version. Note that there may be differences due to 
roundoff error. Put the function call inside a loop to perform the inner product 
10° times. Is there any difference in the times of the two algorithms? Note 
that interpreted languages, such as Python and MATLAB, may see little or no 
difference in timings. This is probably due to the fact that the time savings for 
the unrolled version is negligible compared to the overhead of interpretation. 
Also, interpreted languages will typically “box” and “unbox” the raw floating 
point value, converting it to and from a data structure that contains information 
about the type of the object as well as the object itself. 

The following pseudo-code is designed to provoke a “stack overflow” error: 


function overflow(n) 
if n=2* for some k 
print (n) 
end if 
overflow(n + 1) 
end function 


Implement in your favorite programming language. How does it behave on your 
computer? How large a value of n is printed out before overflow occurs? 
Dynamic memory allocation is memory allocation that occurs at run-time. It is 
an essential in all modern computational systems. Pick a programming language. 
How does this language allocate memory or objects? Do you need to explicitly 
de-allocate memory or objects in your programming language? How is this done? 
There are different ways of automatically de-allocating unusable objects (garbage 
collection) in programming systems. In this exercise, we look at two of them. 
Describe reference counted garbage collection and “mark and sweep” garbage 
collection. What are their strengths and weaknesses? 

Memory leaks are a kind of bug that can be hard to find and remove. These 
arise when memory is allocated for objects that are never de-allocated, and so 
eventually take up all available memory. Even if a system has garbage collection, 
this can still occur. Explain how this might happen. 

Memory allocation and de-allocation can lead to fragmentation of the memory 
allocation system over time, so that there may be a great deal of memory available, 
but no large object can be allocated because the available memory is fragmented 
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into small pieces. Read about and describe the SmallTalk double indirection 
scheme that allows SmallTalk to de-fragment the memory allocation system. 


1.2 Programming Languages 


Algorithms are often described using pseudo-code, but to make them work we need 
to translate these high-level descriptions into a programming language that can be 
executed. MATLAB” is a language that was developed specifically for numer- 
ical computation. It is a commercial system, although there are publicly available 
work-a-like systems such as GNU Octave. Python is a general-purpose programming 
language. The NumPy and SciPy extensions to Python make Python into a suitable 
platform for numerical computation. Julia is also a general-purpose programming 
language, but was developed with numerical computation in mind. C, C++, and Java 
are general-purpose programming languages. C was developed to write the origi- 
nal Unix operating system. C++ was developed as an object-oriented version of C. 
Java was developed to be a write-once-run-anywhere language for web applications 
running on computers, tablets, and smartphones. 


1.2.1 MATLAB!'™ 


Here is an example of MATLAB code: 


n = 10; 
A = zeros(n,n); 
for-R = Aen 


FOr 7) Sen 
A(i,j) = 1/(itj-1); 
end 
end 
rhs df Aa); 
sol =A \ rhs’ 


This solves a 10 x 10 linear system Ax = b where ajj = 1/(i + j — 1) and bj = 
1/i. The exact solution is x = e;. 

MATLAB is an interpreted language, but matrix operations are done in C. Orig- 
inally, MATLAB was an interface to LINPACK and EISPACK, which were written 
in Fortran in the 1970s. Gradually all the code was translated into C. 

Originally, MATLAB’s only data structure was the matrix, and even strings were 
represented in terms of matrices. However, as MATLAB developed and its users 
became more sophisticated, more general heterogeneous data structures were incor- 
porated into MATLAB. The basic mechanism for doing this is the “struct”: 
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obj = struct ("a’,37,'b’,*a string’, 'e"%, [1 27 3° -4)))4 
obj.c 


returns a2 x 2 matrix. Compilers are available for MATLAB, but their use is uncom- 
mon. Three and higher dimensional arrays are now available in MATLAB. 

As with most “scripting” languages, the type of a variable is determined by its 
value. There is no way of declaring that a particular variable to have a particular 
type. The efficiency of MATLAB comes mainly from the fact that underlying matrix 
operations have been implemented in LAPACK and BLAS since 2000. Trying to 
re-create matrix operations by means of for loops is generally very inefficient. 
Optimizing MATLAB code then usually involves “vectorizing” operations so that 
multiple operations can be expressed concisely. 

When passing arrays, MATLAB passes them by reference for efficiency. However, 
MATLAB’s programming model is that of pass-by-value. To reconcile these two 
aspects, MATLAB uses a copy-on-modify rule: if a passed array is modified, a copy 
is created and the copy is modified. In this way, the entries of the passed array are not 
changed and the programmer does not have to be concerned about function arguments 
being modified outside the function. 


1.2.2 Julia 


Here is the corresponding Julia code: 


using LinearAlgebra 

n = 10; 

A = [1/(itj-1) for i = 1:n, j = 1:n]; 
rhs = [1/i for i = 1:n]; 

sol =A \ rhs 


Julia is a just-in-time-compiled language, meaning that there is no separate compiler. 
As you enter or load code, it is compiled. This means that some large packages take 
some time to load as this involves compilation. Large packages can be designated to 
be pre-compiled. 

Julia has a sophisticated type system and uses these types to determine which 
function of the correct name to use. This enables Julia users to override the arith- 
metic operations (+, -, *, /) for user-defined types. For example, we could create a 
specialized three-dimensional vector type: 


struct Vec3 
x,y,zZ :: Float64 # double precision 
end 


and the usual vector operations, as well as the cross product 
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function +(a::Vec3, b::Vec3) 
return Vec3(a.xt+b.x,a.ytb.y,a.z+b.z) 
end 


function cross(a::Vec3, b::Vec3) 
return Vec3(a.y*b.z-a.z*b.y, 
a.z*b.x-a.x*b.z, 
a.x*b.y-a.y*b.x) 
end 


When there is a conflict between two different versions of a function whose input 
types match the function call, Julia will pick the more specific one where this can 
be decided. In this way, you can write one version for a specific case that is highly 
optimized for that case, and a more general version that is slower. 

Julia’s numerical matrix computations use BLAS and LAPACK (see Section 1.1.7) 
wherever possible for efficiency. 

Unlike MATLAB, since Julia is compiled as needed, for and while loops are 
reasonably efficient. Julia passes arrays by reference, but Julia has a convention that 
if a function modifies its arguments, the name must have an exclamation point (“!’’) 
at the end. In this way, programmers are alerted to the possibility that a passed array 
or other object may be modified. 


1.2.3. Python 


Here is the corresponding Python code using NumPy for handling matrices and linear 
algebra: 


import numpy as np 

n = 10 

rhs = np.array([1/(i+1) for i in range(n)]) 

A = np.array([[1/(i+j+1) for i in range(n)] for j in range(n)]) 
sol = np.linalg.solve(A, rhs) 

print (sol) 


Python is usually used as an interpreted language, although there are compilers for 
Python. NumPy performs numerical operations by calling C routines for the NumPy 
array operations. 


1.2.4 C/C++ and Java 


C, C++, and Java are compiled languages, with much of the syntax of these languages 
the same. Java was designed to be run on any platform, and so the output of a Java 
compiler is not machine code, but an intermediate language (Java bytecode), which 
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is interpreted by a Java Virtual Machine (JVM). Java bytecode can be compiled, and 
some JVMs do compile parts of a running Java program to machine code for greater 
performance. There are a number of packages for numerical computations in each 
of the languages with some free and some commercial. Free libraries include the 
GNU Scientific Library for C, while Boost has numerical components for C++, and 
Colt is a general numerical library for Java. Commercial libraries such as ISML and 
NAG have C bindings although are mostly Fortran underneath. However, there is no 
standard “matrix” data structure for these languages. Bindings to Fortran routines 
for BLAS are available in most libraries. 


1.2.5 Fortran 


Fortran is arguably the oldest generally available compiled programming language, 
with the original Fortran I becoming available on certain IBM 704 computers in 1957 
[14]. Fortran has undergone a number of transformations, but is still in use, and there 
is much code that is written in Fortran. When it was developed, Fortran had to show 
that using a compiler could result in executable code that was about as efficient as 
writing assembly language or machine code. So a focus of Fortran was efficiency. 
Consequently, many aspects of modern programming languages were not present. 
For example, recursion was initially not permitted in Fortran. Other modern program- 
ming language features were slowly added: the Fortran 66 standard, approved in 1966, 
is widely regarded as the starting point; Fortran 77, approved in 1978, allowed if 
end if blocks and do ... end do loops instead of requiring numeric 
line labels to designate the end of an if statement or a loop; Fortran 90, approved 1991, 
allowed free format rather than column oriented input, recursive functions, modules 
for combining related functions and variables, pointers (also known as references), 
dynamic memory allocation, interfaces, array slicing, and operator overloading. In 
short, Fortran 90 caught up with many of the requirements of a modern program- 
ming language. Further extensions were made in subsequent standards: Fortran 95 
incorporated some minor extensions, such as foral1 for efficient vectorization of 
array operations; Fortran 2003 with object-oriented programming support, proce- 
dure (that is, function) pointers, and enhanced module support allowing separation 
of interfaces and implementation; Fortran 2008 provides some additional support 
for parallel computing; finally, Fortran 2018 provides some enhancements regarding 
interoperability with C, and improvements in the parallel computing features. 


real, allocatable, dimension(:,:) :: a, b, ¢c 
! missing initialization code to 
! allocate a, b, c and initialize a and b 
do j=l1,n 
do i=1,n 
tmp = 0.0 
do k=1,n 
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tmp = tmp + a(i,k) * b(k,j) 
enddo 
c(i,j) = tmp 
enddo 
enddo 


Standard components of modern scientific computing, such as BLAS and LAPACK, 
are written in Fortran. In fact, LAPACK was written in Fortran 77 until 2008, when 
it was translated into Fortran 90. While Fortran is regarded as “old fashioned” (even 
back in 1968), it is still an important language, especially for scientific and numer- 
ical computing. Modern versions of Fortran provide many of the features expected 
of modern programming languages, although many of these are “retrofitted”, rather 
than part of the initial design. 


Exercises. 


(1) Pick your favorite programming language. Is it interpreted? Is it compiled? What 
is (or are) the data type(s) for floating point numbers? Is there a built-in data 
type for vectors or matrices? Can the vectors (or matrices) be added? Can they 
be multiplied? 

(2) What facilities are included in your favorite programming language for parallel 
programming? Are there libraries that provide these facilities? Is it possible 
to perform both shared memory and distributed memory computing in your 
language? With these libraries? 

(3) Is your favorite programming language garbage collected? That is, is there auto- 
matic de-allocation of unusable objects? What are the advantages and disadvan- 
tages of garbage collection? 

(4) The following function applies the function f to each item of a vector of items. 
Implement it in your favorite programming language. 


function map(f,x) 
y<new array of same size as x 
for i an index of x 
yi — FO) 
end for 
end function 


Now implement it in your least favorite, or a previously unknown, programming 
language. 

(5) Macros are pieces of code that transform other pieces of code before compilation 
or execution. Does your favorite programming language have macros? What 
kinds of transformations can the macros in that language perform? If your favorite 
programming language does not have macros, or you discover another language 
that does have macros (such as Lisp or Julia), describe the macro facilities of 
that language. 
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(6) The C programming language was developed for implementing the Unix oper- 
ating system. This led to many design decisions by the creators of the language, 
such as pointer arithmetic, and lack of array bounds. Explain why these decisions 
may have been necessary for their purpose at that time, and if you think those 
decisions help or hinder the creation of mathematical software now. 

Some languages are stack languages, such as Forth, Joy, and PostScript. Argu- 
ments for a “function” in a stack language do not have names, but the kth input 
is the kth item from the top of the stack. Download one of these languages and 
implement the “map” function of Exercise 4 in that language. 
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1.3 Floating Point Arithmetic 


Real numbers cannot be stored in a computer, because storing a real number, like 
mw = 3.1415926..., 


would take an infinite amount of memory. So computers and calculators store just 
a finite number of digits or bits after the decimal point. Computer scientists and 
mathematicians have worked since the dawn of electronic computers to find efficient 
ways to accurately represent, or approximate, real numbers. The common idea is 
to use floating point arithmetic. A floating point number consists of three parts: a 
sign (+), a mantissa or significand (where most of the digits or bits of interest are), 
and an exponent. Floating point numbers have a base, which is typically 2 or 10 but 
was sometimes 16 or something different. For floating point numbers in base b we 
represent a number x as 


x = +(do.d\dz---dn)p x b®, 


where e is the exponent and (dp.didz...dm)p = dy + di /b+ d/b? +--+ +dy/b” 
is the mantissa. 

Before 1985, different manufacturers of computers and computer hardware used 
different formats for storing floating point numbers, making it difficult to use pro- 
grams written for one computer system on another. It also made it difficult to reason 
about how software operating on floating point numbers should behave. In 1985, the 
Institute for Electronic and Electrical Engineering standard IEEE 754 for floating 
point arithmetic was adopted and has become the main standard used for floating 
point arithmetic. 


1.3.1 The IEEE Standards 


In 1985, the IEEE Standard for Floating-Point Arithmetic (IEEE 754) was published 
by the Institute for Electrical and Electronic Engineers (IEEE). It represented the cul- 
mination of collaboration between industry, academics, and engineers [128]. Since 
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single precision (32 bits) 
x] Q10903...QAg by b2b364bs os aes bo9b93 


exponent significand 


double precision (64 bits) 
+] aja9a3 ... Q41 | bybeb3b4bs «.. 51 B59 


exponent significand 


extended precision (80 bits) 
x] a1Q903 ee Q15 6, b2b3b4bs erat bg2b63 x 


exponent significand 


Fig. 1.3.1 TEEE 754: Floating point standards 


then, most floating point computations have been carried out through the IEEE stan- 
dards. Before the IEEE 754 standards, there were many different and incompatible 
floating point systems. Trying to write software to work under the different formats 
and systems was challenging. The IEEE standards completely changed this, but not 
by aiming for the lowest common denominator. The IEEE standards set a new bar 
for how floating point systems should work, with support for gradual underflow, 
correctly rounded arithmetic, rounding modes, and with values for “infinity” and 
“Not-a-Number”. More extensive discussions on floating point arithmetic can be 
found in [102, 194]. 

The IEEE standards were actually three different standards for single precision, 
double precision, and extended precision floating point arithmetic. The three stan- 
dards use binary representation of numbers (base 2), but use different numbers of bits 
for the different components as well as different numbers of bits for the entire number. 
The important components for a floating point number are: the sign bit, the exponent, 
and the significand. The sizes of these components are shown in Figure 5.4.2. In what 
follows, we let M be the number of bits in the exponent field, and N the number of 
bits in the significand (Figure 1.3.1). 

The leading bit is the sign bit which is zero for “+” and one for “—”. The 
exponent field does not directly represent the exponent. Instead, the exponent is 
€ = (a) a2 ...ay)2 — Ey where Ep is chosen to give a balance between positive and 
negative exponents: Ey = 2”—! — 1. This gives values of e ranging from —Ey = 
1—2”-! to (2” — 1) — Ey = +2”7! = Ey + 1. Provided —Ep < e < Ey + 1, the 
number represented is 

+(1.bjb2...bn)2 x 2°. 
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This is a normalized floating point number. Unnormalized floating point numbers 
have a lead digit (or bit) that is zero. An advantage of working in binary is that a 
normalized floating point number must start with a “1”. Unnormalized floating point 
numbers can be normalized by shifting the bits in bjb2 ... by left and adjusting the 
exponent down to compensate, until the leading bit is not zero (and must therefore 
be 1). The only exception to being able to normalize an unnormalized floating point 
number isifb,b2...by = 00...0, thatis, the number is exactly zero, or the exponent 
field hits its minimum value. 

If e = —Eo, in which case (aja2...day)2 = (00...0)2 = 0, the number is a 
denormalized floating point number 


+(0.b,by...by)2 x 278°! = +(b).b2...by)2 x 27°. 


If e = +E — 1, in which case (aja...ay)2 = (11... 1). = 2” — 1, the quantity 
represented is either infinity (denoted “Inf”) or a NaN (“not-a-number’’). Infinity can 
be signed, depending on the sign bit, so both +Inf are possible distinct values. The 
quantity represented is then +Inf if bjbz...by = 00...0; otherwise, the quantity 
represented is a NaN, which is best understood as “undefined”. Any arithmetic oper- 
ation or function evaluation with a NaN results in NaN. Any comparison of a NaN 
with any floating point number gives the value “false”. That is, if x is a NaN then 
even the equality test “x = x” will return false. This gives a quick way of identifying 
if a quantity is a NaN. 

There are many ways of creating “Inf” and “NaN”. For creating “Inf” we can 
use division by zero (such as 1.0/0.0), very large values (such as exp(BigNumber) 
or VeryBigNumber x AnExtremelyBigNumber), or certain functions applied to +Inf 
(exp(Inf), Inf’, log(Unf), or VInf). NaN can be created by zero on zero (0.0/0.0), 
certain function values (sin(Inf) or cot(0)), or operations on “infinity” (Inf — Inf, 
Inf/Inf, or 1), 

Any computation that gives a floating point result that is too small in magnitude 
to be a normalized, is called underflow. If the result is a denormalized number that 
is not actually zero, then it is called gradual underflow. Any computation that gives 
a floating point result that is too large to be normalized (so that the result is +Inf or 
NaN), we say that there has been overflow. There is nothing gradual about overflow. 

NaNs are to be avoided wherever possible. Once created, they tend to propagate. 
Anything it touches becomes NaN. Even 0 x NaN is NaN. In the time before IEEE 
arithmetic, what happened instead of generating this symbolically undefined quan- 
tity is that a program in that situation would crash: the program would terminate, 
hopefully generating a useful error message about what went wrong and where it 
happened. Now, a program will happily continue past the point at which a NaN is 
generated, probably using that value multiple times resulting in many NaNs in your 
results. You can expect to then have a printout of results consisting mostly, if not 
entirely, of NaNs. You will look at this useless output and ask “What went wrong?” 
and “Where did it go wrong?” The NaNs will not tell you, unless you put tests in 
your code to declare an emergency as soon as a NaN is found. 
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NaNs also have special properties regarding comparisons: any comparison with 
a NaN is false. So, for example, NaN? > 0 is always false, as is NaN < Inf. Most 
importantly NaN = NaN is also always false. Thus, a variable x can be tested to be 
a NaN by checking if x = x; if x is not a NaN then the result is true, but if x is a 
NaN, then the result is false. In this way, NaNs can be identified. 

Apart from debugging purposes, testing explicitly for NaNs in programs is often 
not needed, except perhaps in code for high-reliability applications. It is even better 
to design your code to avoid NaNs. This may not always be possible if your inputs 
include user-defined functions. While it might be hoped that your clients would write 
functions that never generate a NaN, it can happen, so you should be prepared to 
deal with this eventuality. 


1.3.2. Correctly Rounded Arithmetic 


The most important property of IEEE arithmetic is that it is correctly rounded. That 
is, provided x and y are floating point numbers and o = +, —, x, /, the computed 
value of x o y is the nearest floating point number to the exact value of x o y. This 
means that there are some very useful models of floating point arithmetic that apply 
to IEEE arithmetic, but not necessarily to other implementations of floating point 
arithmetic. To develop these models we need a notation that distinguishes between 
the value of an expression using real numbers, with infinite precision, and the value 
obtained by using a given system of floating point arithmetic. For a given expression 
expr, we use expr to represent the exact value, and fl(expr) to represent the value of 
expr as computed using floating point arithmetic. 

IEEE arithmetic achieves correctly rounded arithmetic for the usual floating point 
operations by using guard digits; these are extra digits (bits, actually) to achieve some 
additional accuracy during the computation, but before rounding, so that the final 
rounded result is correct. [EEE arithmetic, in fact, has several rounding modes, of 
which round-to-nearest is the default option just described. Other rounding options 
include round-up, round-down, and round-toward-zero. 

Using the default round-to-nearest mode enables us to give a formal model for 
IEEE arithmetic that is not a complete model in that it does not completely specify 
the result of a given floating point operation. Many attempts have been made to create 
an algebra for floating point operations. However, the resulting algebra must be a 
non-associative: that is, there are a, b, andc where (a + b) +c a+ (b+ c) where 
“4” is taken to be floating point addition. To see this, suppose that in a given floating 
point system that is correctly rounded, there must be a positive floating point number 
5 > 0 where the computed value fl(1 + 6) = 1; choose 6 > 0 to smaller than half 
the distance from one to the next largest floating point number. Then 


flGl(S5 +1) +(—-1)) =f + (-1)) =0 while 
fl(é +f + (-1))) =fl(6 +0) = 5 £0. 
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Fig. 1.3.2 Plot of g(x) = x — 10x4 + 40x3 — 80x? + 80x — 32 


As another example, to illustrate the difficulty in giving an exact model, consider 
the function g(x) = (x — 2)° = x — 10x* + 40x? — 80x? + 80x — 32. A plot of 
computed values of the expanded expression at steps of 10~° computed using 
MATLAB” using IEEE double precision is shown in Figure 1.3.2. Attempting 
to predict exactly what rounding errors will be incurred is extremely difficult and 
makes analysis difficult. 


1.3.2.1. A Formal Model of Floating Point Arithmetic 


Instead of trying to model the exact behavior, we just try to determine bounds for 
the results of a floating point operation. If z is a positive real number in the range 
of normalized floating point numbers, from 2~“°t! to 2+”°, then the closest floating 
point number /i(z) to z satisfies 


lz —fl(z)| < ulz| 


for a positive number u called unit roundoff. Unit roundoff can be determined to be 
2- where N is the number of bits in the significand. Values of unit roundoff for the 
three IEEE standards, along with the approximate ranges of the three standards, are 
shown in Table 1.3.1. 

Note that absolute error in a computed quantity Z with exact value z is just the 
difference |z — 2], the relative error is |z —2| /|z|. Then if z is in the range of 
normalized floating numbers, then the absolute error in fl(z) is < u|z|, while the 
relative error is < u. 
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Table 1.3.1 Parameters of floating point arithmetic 


denormalized normalized 
unit roundoff (u) | smallest number | smallest largest 
(fmin) 


single 9-23 xX 2-148 Sw 2-126 xX xX 2128 ~ 
1.2 x 1077 2.8 x 10-* 1.2 x 10°8 3.4 x 10°8 

double 9-52 ~X 2—1072 aw 2- 1022 xX xX 1024 ~ 
2.2 x 10716 2.0 x 107323 2.2 x 107308 1.8 x 1038 

extended 2-63 xX 2— 16447 aw 2— 16 382 xX xX 232768 ~ 
1.1 x 10719 9.1 x 104952 3.4 x 1074932 1.4 x 109864 


This leads to the following model of floating point arithmetic: for floating point 
numbers x and y and operations * = +, —, x, /, 


(1.3.1) flx*xy)=(xx*xy)1+e)  forsome |e| <u, 


provided there is no underflow (including gradual underflow) or overflow. Again, u 
is unit roundoff. To expand this to allow for underflow, including gradual underflow, 
we need an additional parameter fin, the smallest positive floating point number: 


(1.3.2) Ax y)=@xy)d+e)+n — forsome |e] <u, || < fmin. 


More general functions can be correctly rounded, although this is much harder to 
achieve. One reason is that to determine the correct rounding of a function value f (x) 
may require knowing the value of f(x) to much higher precision than the floating 
point arithmetic provides if f(x) is close to the midpoint between two adjacent 
floating point numbers. Providing a formal model for function evaluation is more 
complex, and the error behavior of a function can depend very much on how the 
function is implemented. We do expect that built-in functions (such as exp, log, 
square roots, sin, cos, tan) are well implemented and have an error behavior as good 
as can be expected. A suitable model for a well-implemented function is 


(1.3.3) ACF) = f(A +e)x) +e) — forsome |e;|, le] <u 


provided there is no underflow (including gradual underflow) or overflow. Incorpo- 
rating underflow requires the fin parameter: 


(1.3.4) 9 A(f@)) = f(d tex +m) +62) +n2 — forsome fei], leo! <u, Im], Ino! < finin- 


A reason that we cannot use fl(f(x)) = f(x)( + €) with |e| < u alone is that if 
f(x) © 0 and x © | there can still be an error of size u from preliminary operations 
on the input x. Consider, for example, f(x) = sin(zx) implemented as sin (pi*x) 
in your favorite programming language. (In Java, use sin (Math. pi*x).) Using 
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the built-in sin function and the built-in value of z, this should be considered 
“well implemented”. Yet, f(1) = sinz =O while its computed value in MAT- 
LAB is fl(f(1)) © 1.22 x 10-!° which is not fd) d+.) for any e. Of course, 
the numerical value of z used is fl(7) #2 and the difference |fi(7) —z| < 
uz ~ 6.91 x 107!°. This explains the error in f(1). We can, however, represent 
fi(fd)) = fad + «,)) d + €2) for some |e;|, |€2| < u. In this case, €; applied to 
the input is more important than € applied to the output. 

In the next section (Section 1.4), we will apply this formal model to understand 
how things sometimes go wrong. 

For some computations, floating point arithmetic is exact: adding zero, and mul- 
tiplying by a power of two (provided neither overflow nor gradual underflow occur). 
Multiplying by a power of two generally only results in a change in the exponent. 
Addition, subtraction, and multiplication of modest sized integers are also exact: as 
long as the number of non-zero bits of the inputs and the output in binary is less than 
the number of bits of the significand, the results are computed exactly. 


1.3.3 Future of Floating Point Arithmetic 


While IEEE floating point arithmetic is the mainstay of modern numerical compu- 
tation, there are a number of alternatives going in different directions: 


Greater accuracy: variable-size arithmetic, quad precision, software defined pre- 
cision, and “double double” arithmetic. 

Greater speed: often lower accuracy but with fewer bits and therefore faster to 
move the data. 

Greater reliability: interval arithmetic, which gives guaranteed bounds on the 
results. 


There are always new proposals coming forward, so there is little hope of any descrip- 
tion remaining comprehensive. However, there are some common threads to the 
alternatives to floating point arithmetic. We will look at the different themes in turn. 


1.3.3.1 Greater Accuracy 


There are many ways of gaining greater accuracy. Fortran has long had its own 
quad precision type where double precision was not sufficient. C and 
C++ have a long double type that is often, but not always, implemented by 
the IEEE extended precision standard. The IEEE754—2008 revision of the IEEE 
standards allows for a 128-bit quad precision arithmetic. However, this has not been 
implemented in commonly available hardware, so software implementation must 
be used. This makes the quad precision arithmetic very slow in comparison with 
double or extended precision. There are also “double double” and “quad-double” 
packages that leverage the hardware advantage of using a pair (or four) of IEEE 


30 1 Basics of Numerical Computation 


double precision numbers to create a virtual quad precision system with 106 bits of 
mantissa (105 bits of significand) [124]. 

There are pure software systems for arbitrary precision arithmetic, including the 
GNU Scientific Library [107], as well as “BigFloat” or “BigNum” systems in Java, 
Julia, Python, and other languages. As with other software implementations of float- 
ing point arithmetic, these are relatively slow. Used strategically, though, they can 
be very useful. 

There are more adventurous proposals that aim for efficient and hardware- 
implementable floating point systems that try to make better use of the bits that 
are available. The IEEE standards have a fixed exponent size. This means that quan- 
tities with values closer to one have the same number of significand bits as much 
larger or much smaller quantities. Since most computation uses quantities that are 
closer to one, it is natural to try to allow the number of exponent bits to vary so that 
these quantities have smaller exponent fields and larger significand fields. Variable 
exponent size is a challenge for hardware implementation and harder to analyze as 
there is no longer a single “unit roundoff” to estimate errors. Various “flexible expo- 
nent” or “flexible range” forms of floating point arithmetic have been proposed, such 
as Clenshaw and Olver’s level-index arithmetic (see [52, 53]) which uses an iterated 
logarithm scheme to represent numbers over a very wide range. A more recent pro- 
posal is Gustafson’s unum idea [115], which is a simpler “flexible exponent” idea 
than level-index arithmetic. For most high-performance computing needs, there has 
to be a fixed number of bits, so that individual entries in an array can be accessed at 
will. Whatever the number of bits, there is a finite range; overflow and underflow are 
still possible. The higher complexity of these systems mean that the basic model of 
“correctly rounded” floating point arithmetic has to be replaced by a more sophisti- 
cated, but less understandable, model of floating point error. In short: each system 
has its limits. 


1.3.3.2 Greater Speed 


There are developments in the opposite direction: lower precision for greater speed. 
Applications in signal processing, graphics, and machine learning have amplified the 
desire for greater speed. These applications typically have lower accuracy require- 
ments than most scientific computation tasks. Uncompressed graphic and audio data 
streams typically have | byte per audio sample and | byte per color stream per pixel. 
Machine learning applications often use low-precision computations and stochastic 
algorithms where high precision is not useful. 

This has led to the use of half precision (16 bit) floating point numbers. These 
can be transferred roughly twice as fast as single precision (32 bit) and four times 
as fast as double precision (64 bit) floating point numbers. SIMD (single instruction 
multiple data) units for higher throughput can process four half precision numbers 
for each double precision number processed. 

A further alternative to half precision floating point arithmetic is to use fixed- 
point arithmetic. Fixed-point arithmetic is certainly not a new idea; floating point 
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arithmetic was first developed for computing to overcome the limitations of fixed- 
point arithmetic in the 1950s. Beyond this, digital signal processing has long used 
fixed point arithmetic, usually implemented in hardware to achieve sufficient speed. 
Half precision floating point arithmetic can be analyzed in the same way as higher 
precision floating point systems. Fixed-point arithmetic can be analyzed in similar 
ways, often with u = 0 and fin set to the smallest positive fixed-point number. 
Appropriate methods of analysis of fixed-point arithmetic depend on the system. 


1.3.3.3 Greater Reliability 


The best way to greater reliability for floating point arithmetic is interval arithmetic 
[140, 180]. Instead of having a single floating point value, we use intervals rep- 
resenting a range of possible values. Then we can apply arithmetic operations to 
intervals 


[a,b] + [c,d] ={x+y|x € [a,b], y € [c,d]} =[at+c, b+d], 
[a,b] —[c,dJ={x-—y|x € [a,b], y €lc,d]}=la—d, b—cl, 
[a,b] x [c,d] ={x- y|x € [a,b], y €[c,d]} 

= [min(ac, ad, bc, bd), max(ac, ad, bc, bd), 


1 
[a, b] /[c, d] = [a, db] - ic. dl with 
[4, +], if 0 ¢ [c,d], 
_ [5, 00), ifc=0, 
[c,d] | (—00, 41, ifd =0, 
(—oo, 4] U[4, 00), otherwise. 


We can guarantee inclusion if we use the round-down rounding mode for the lower 
bound and round-up for the upper bound, and +Inf where appropriate. This means 
that we can guarantee the true value lies in the interval computed. 

The computed interval may be much wider than necessary, in which case the result 
of standard floating point computation should be used. This is the basic idea of the 
language triplex Algol 60 [7]. 


Functions can be applied to intervals: for increasing functions such as exp, log, V- ; 
we have f([a, b]) =[f(a), f(b)]. Again, the rounding mode should be set appro- 
priately according to whether the upper or lower bound is being computed. Functions 
that are a mixture of increasing and decreasing such as x +> x? and trigonometric 
functions can also be computed accurately for intervals. Interval computations can 
give results that are well beyond the actual bounds of function values. Consider, for 
example, computing cosh x = 5(e + e~*) with the interval [—2, +2]: 
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cosh([—2, +2]) C 5 (exp((—2, +2]) + exp(—[—2, +2])) 
1 
56 
1 = 
cosh([—2, +2]) = [1, cosh(2)] = [1, se +e >). 


[e~*, et?] + [e7?, e**]) = [e-?, e*?] whereas 


Interval arithmetic can be a very powerful tool. However, it is best used selectively. 


Exercises. 


(1) Compute machine epsilon for your computer and programming language: 


x<l1 

while l+x>1 
x<—x/2 

end while 

x < 2x 


What is machine epsilon for your system? 
(2) Estimate the range of your floating point system: 


x<1; y<2x 

while y—y=0 // fails if y=Inf 
x<y; y<2x 

end while 

print x 


Report the value of x returned. Check that the computed value of 2x is Inf. Is 
1.99 x x computed to be Inf as well? 
(3) The Taylor series for eX = 1+x+x7/2!+2x3/3!4+---+x"/n! +--+. We 

approximate e* © 1 + x +.x7/2! + x3/3!+---+.x"/n! for some fixed n: 

y<1; term<x 

for k=1,2,...,n+1 

y<yt+term; term < term x x/(k+ 1) 
end for 


If n = 20 the remainder term of e* for |x| < 1 is less than unit roundoff for 
double precision. Using this approximation for x = +1 compute an estimate for 
e*! x e7!. If this was done in exact arithmetic you would get exactly one. What 
is the difference between the computed value of e+! x e~! and one with this 
method? 

(4) Modify the code in the previous question to sum the terms in reverse order, that 
is, from highest order terms to lowest order terms. Repeat the test of finding the 
computed value of e+! x e~! — 1 using this method. Which has smaller error? 

(5) Ininterval arithmetic, a number u is represented by an interval [uv, w] that contains 
the exact value of the number. A vector of two intervals is a rectangle with sides 
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parallel to the x- and y-axes. Show that the operation applied to a vector of 
rectangles [u, v]|’ of the same width 


a | APES ee] Dole =a 

Beret liebe 
results in a rectangle [x, y]” that has width 2 times the width of [u, v]". 
Compare three ways of summing a large number of roughly equal positive quan- 
tities: aaa a;: (1) sum from the beginning; (2) sum from the end; (3) use the 
recursive algorithm Algorithm 2. Give bounds on the error in each of these sum- 
mation algorithms of the form ee ax nN, where n; are numbers independent of 
the a,. Which algorithm minimizes your bound on max, |n,|. [Hint: Note that 
Ud +a/(1—a)) + 8/0 — B)) < 1+ @+ B)/A- @+ B)).] 
(7) Overflow and underflow can occur for surprisingly modest inputs to the expo- 
nential function. What are the smallest values of x where e* results in either 
overflow or underflow in IEEE single precision, IEEE double precision, and 
IEEE extended precision? 
Interval arithmetic can be used to do more than simply rigorously control round- 
off error. The interval version of Newton’s method for solving f(x) = 0 is 


(6 


wi 


(8 


wm 


(1.3.5) Xn+1 = Xn a) (ty — Ff Gn) /f' On) : 


Note that division of intervals [a, b]/[c, d] is a pair of intervals if c <0 <d. 
Implement this method, noting that at each iteration you will need to keep a 
union of intervals rather than a single interval. 

The Table Maker’s dilemma is the problem that determining correctly rounded 
values of a transcendental function may require evaluation of that function to far 
higher precision. This issue is discussed in [159], where quad precision is used 
to resolve these issues, at least for double precision tables. Read this article and 
explain how these issues are resolved in [159]. 


(9 


Ym 


1.4 When Things Go Wrong 


1.4.1 Underflow and Overflow 


Overflow is an all-or-nothing phenomenon. Underflow can be gradual or immediate. 
In diagnosing these problems, we look for results that are Inf, NaN, or zero. 
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(x,y) 


Fig. 1.4.1 Computation of cos 6 and sin @ given (x, y) coordinates 
1.4.1.1 A Trigonometric Computation 


Consider, for example, computing cos @ and sin 6 from (x, y) coordinates of a point, 
as illustrated in Figure 1.4.1. The formulas we use are 


(1.4.1) a 
/x2 4 y? 
(1.4.2) sing = —— 


[x2 4 2 


In MATLAB, using double precision for x = 107° and y = 107° the computed 
values of cos@ and sin@ from (1.4.1, 1.4.2) are both zero. Why? For cos@ = 
x/./x?2 + y?, we can get the computed value of \/x? + y? = Inf. Note that if we 
use x = 10! and y = 10! then ./x? + y? is finite and x/,/x? + y? is computed 
to differ from 1/./2 by about 1.1 x 10~!®. So there is a threshold in the size of x 
and y where this bad behavior occurs. This indicates overflow. 

We can see overflow already in the computation of x? which evaluates to 
fl (x”) = Inf. To see why, x = 102 so x? = (107)? = 104°. However, the largest 
number representable by double precision arithmetic is ~ 1.8 x 10°°8. Clearly 
10*°° is too big, so the computation overflows and gives the value Inf. Similarly 
f(y?) = Inf. This gives fl(x? + y*) = Inf + Inf = Inf. Furthermore, /Inf = Inf so 
fl\/x? + y2) = Inf. So the final computation is fl(x /./x? + y2) = fl(107/Inf) = 0 
as 10°", while large, is insignificant next to Inf. The same computations and results 
are obtained for computing sin 0. 

If we try the computation with x = 107? and y = 107? then we get the 
reverse issue: both cos@ = x/./x?2 + y? and sin@ = y/,/x? + y? are evaluated to 
Inf. Again, using x = 107! and y = 107! the computed result differs from 
the correct value 1/./2 by about 1.1 x 107!®. The problem with x = 107 and 
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y = 10-7 is underflow. We can see this in computing x? for x = 10770: this gives 
zero. To see why we see that for double precision the smallest positive floating 
point number is © 2.0 x 107373 while x? = (1072)? = 10~4°° is much smaller. So 
the computed value must be zero. Then fl(x? + y”) = fl(x?) + f(y?) =0+0=0. 
Then fl(,/x2 + y2) = fl(/0) = 0. Finally, the computed value fl(x/./x2 + y2) = 
fldo-200 /0) which evaluates to Inf. 


1.4.1.2 Finding a Remedy 


Now that we have diagnosed the problem, how do we find a way to do the computation 
without overflow or underflow? 

The essential issue here is that if x is sufficiently large but finite, x” can overflow. 
Or if x is sufficiently small but non-zero, then x? can underflow to zero. The values 
of x/./x? + y* should be invariant under scaling. That is, if s > 0 then 


x 2 (x/s) 
Jetty Jas? + 0/5)? 


By choosing s appropriately we can prevent overflow from happening. We focus on 
avoiding overflow first, as overflow is more likely to be fatal to the computation. We 
still need to avoid the denominator underflowing as well. We could scale (x, y) by 
|x|: 

(x/ |x|) = sign x 


V@/D2 + O/ xb? JT + 0/x)? 


This will avoid overflow if |y/x| is not large. So this appears to be appropriate if 
|y| < |x|. If|y| = |x|, then we should be scaling by |y|: 


(x/ ly) iz (x/ly) 
V@/ly)? + O/ ly? = V@/yy +1 


With these formulas, it is still possible that (y/x)? underflows if |y| < |x|, or that 
(x/y)* underflows if | y| > |x|. However, this does not cause problems: for |y| < |x|, 
if (y fea underflows, then |y/x| < /finin <u (at least for the IEEE standards). The 
computed value of the denominator fl(,/1 + (y/x)*) = 1, so that the result is sign x, 
which is the correctly rounded exact value. For |y| > |x|, if («/y)? underflows, then 
|x/y| < /fnin <u, so flG/(x/y)? + 1) = 1 and the computed result is fl(x/ yl). 
The absolute error in this value is no more than u, but it is even better than that. 
If fl(x/|y|) itself does not underflow, the relative error is no more than u: x/y = 
cot@ = cos6@/sin@ where sin@ = sign y/,/1 + (x/y)? differs from sign y by much 
less than u. So cos 6 = (x/y) sin@ © (x/y) sign y has a relative error of much less 
than u, and therefore fl(x/|y|) has a relative error of no more than 2u. 
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Algorithm 5 Computing cos 6 and sin @ from (x, y) 


a) function cos_sin_xy(x, y) 

2 if x=0 & y=0: return fail 
az) if |x| =lyl 

4 reoy/|xl; d<—1/v1+r2 

5 c<(signx)d; s <(y/|x|)d 
6 else 

7 r<x/lyl; d<1/Vv14+r2 

8 c<(x/|y|)d; s <—(signy)d 
9 end if 

10 return (c,S) 

11 end function 


Code for implementing the algorithm for computing both cos @ and sin@ given 
(x, y) is shown in Algorithm 5. Note that the only case where the algorithm fails is 
if (x, y) = (0, 0), where the computation is impossible. 

This kind of computation is common enough that there is afunction hypot (x,y) 
that is available in most libraries for computing ,/x? + y? without overflow or under- 
flow because of the squaring. 


1.4.2 Subtracting Nearly Equal Quantities 


1.4.2.1 Computing (1 — cos x)/x? for x © 0 


Consider the expression (1 — cos x)/x?. We can use l’Hospital’s rule to show that 
lim,_,9(1 — cos x)/ r= /2. When we compute values numerically (in MATLAB), 
however, we get the results in Table 1.4.1. 

Clearly something goes very wrong for x < 107°. But even for x = 107°, 10~°, 
and 10~’, there is something unusual happening. Using the Taylor series for cos x = 
1 qx? + ux4 ax° +--+ we get 


1—cosx 1 1 , 
— Xo =F x 1 B® x 
x? 2 24 720 2 24 


So the value computed for (1 — cos x)/x? should be a little below 5. Instead, for 
x = 107° and 10~° the computed value is a little above 5, while for x = 107’ the 
computed value is again below 4, but the distance from ; has only increased for 
these values of x. Why? 

We want to understand not just the catastrophic loss of accuracy for x < 1078, 
but also the gradual loss of accuracy for x in the range 10~> to 10-7. Assuming 
no underflow or overflow we will use the formal model (1.3.1) noting that the € 
is different for each computation: assuming that x is a normalized floating point 
number, 
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Table 1.4.1 Computed values of (1 — cos x) /x? 
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IT —cosx I —cosx 
x x2 x £9) 
107! 0.499583472197429 10-6 0.500044450291171 
10-2 0.499995833347366 1077 0.49960036108 1320 
10-3 0.499999958325503 10-8 0 
10-4 0.499999996961265 10-° 0 
10-5 0.500000041370186 | 10~!9 0 
1 —cosx fld — cos x) 
fl 5 = 5 (1+ 1) 
x fl’) 
(1) — fl(cos x))(1 + €2) 
al 5 (1+ €1) 
f(x)? + €3) 
(1 —fl(cosx)) 1+ € 
= 2 (1 + €1), 
x 1+6 


where |€;|, |€2|, |€3| < u. Applying the function model to cos x we have fl(cos x) = 
cos((1 + €5)x) (1 + €&6). For x © 0, cosx © 1 — $x? so fi(cosx) © cosx + €6 + 
x7 (€5 — 5&6). Thus, for x ~ 0, 


3 o (+ «1) 


fl 1—cosx\ 1 —cosx — 65 + x7(€5 — $65) l+& 
x2 1+6 


xX 


l—cosx 1 l+e 
= 1 
2 wats 5«6| ie + €i) 
as 1 Jlt+e 
~~ 1 
E ee et 5*| ie 


The expression 


l+e 


(1+ €,) differs from 1 by < 3u+ O(w’). 
1+6 


So the main part of the error for x small is €¢/x?, which is bounded by u/x?. If 
x = 10-® then for double precision, u/x? * 2.2 x 107!°/(1078)* = 2.2 which is 
larger than s, so we do not expect any accuracy for x = 10~* using double precision. 
Which is what Table 1.4.1 shows. This also allows us to get estimates of the size 
of the roundoff error in fl((1 — cos x)/x7): for x = 10~°, u/x? ~ 2.2 x 107+ while 
the actual roundoff error for x = 10~° is + 4.4 x 107>. The difference between 
fl(( — cos x)/x?) and 5 can be broken down into the roundoff error in computing 
(1 — cos x) fe and the difference (1 — cos x) i _ 5 (called the truncation error). 
These errors are shown in Figure 1.4.2. 
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10° 
rounding errors 
— —~truncation errors 
107) 1 
S 
wi 
19°19 Ie 4 
1915 L 4 
10°? 10° 10° 


x 


Fig. 1.4.2. Roundoff and truncation errors for (1 — cos x)/x? > : asx > 0 


1.4.2.2 Subtracting Nearly Equal Quantities 


The fundamental issue is that in computing 1 — cos x for x ~ 0 we are subtracting 
nearly equal quantities. This results in potentially large relative errors. A simple 
decimal example illustrates this idea: 


3.285253249 
—3.285252015 
0.000001234 


The result has about four digits of accuracy while the quantities subtracted each 
have 10 digits of accuracy. The reason for the loss of the digits of accuracy is the 
subtraction of nearly equal quantities. This can be seen in the relative error of a 
difference of two expressions e; — e2: 


le) — eg — fl(e; — e2)| 
le; — e9| 


Now fl(e; — e2) = (fl(e1) — fl(e2)) CA + €1) with |e;| < u. If e; and e2 are computed 
accurately, then fl(e;) = e;(1 + 71) with |n,| = O(w) and fl(e2) = e2(1 + 72) with 
|n2| = O(a). Then 


|e) — eg — fl(e; — e2)| = ler — €2 — (11 + m1) — 22. +772))0 + €1)| 
< leim — e221 + lel ler. + m1) — e2(1 + n2)I 
S Jeim — e2n2| + lei| [ler — e2] + leim — e2ne|] 
= (1 + Jei|) lei — e2m| + lei] ler — e2]. 
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So 
€1 — ey — fie; — e2)| lein, — e2N2| 
< d+ les]) ———— + Ie 
|e) — ép| ley — eo| 
lei| + leal 
< (+ u)———— max(|m|, |72|) + u 
|e) — e| 
e;|+ Je 
_ lal teal yay 
le) — e2| 


If ({e1| + |e2|)/ |e1 — e2| is large, then there can be a large reduction in the relative 
accuracy, due to the subtraction of nearly equal quantities. Indeed, if 
(le1| + |e2|)/ le; — @2| is of the order of 1/u, then the resulting error can be 100% 
or more of the true value e; — é2, in which case the loss of precision is regarded as 
catastrophic: there are no correct digits in the result. 

It is therefore desirable to avoid subtracting nearly equal quantities. In some cases, 
this cannot be avoided. But we try to avoid it where we can. 


1.4.2.3 Remedying the Loss of Accuracy 


If the problem of computing (1 — cos x)/x? for x * 0 is subtracting cos x from one, 
then we should reformulate the expression to avoid this subtracting. If we can perform 
the subtracting symbolically instead of numerically, we can improve the accuracy. 
For example, 


1 —cosx 1—cosx 1+ cosx 1 — cos? x 
x2 x2 l+cosx  x2(1+cosx)’ 
Now using | — cos? x = sin? x we get 
1—cosx sin? x sinx \? 1 
x2 x2(1+cosx) \ x 1+ cosx’ 


No longer do we have a subtraction of nearly equal quantities for x ~ 0. This new 
formula probably would not work well for x + +z as there 1 + cosx ~ 0. But for 
|x| < 2/2, for example, this new formula should work well. 

There are many other examples of how we can use similar techniques to give 
new expressions that are equivalent to the original expression in exact arithmetic, but 
avoids subtraction of nearly equal quantities and give greater accuracy. For example, 
for x ~ 0 
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Vi+x—-(+5x)  JT+x-— (1+ 5x) VIF x+ (1+ 5x) 
: 7 x? Viereee ee 
_ (l+x)-—(1+ 5x)’ _ — 5x? 
eT Fat tix) x2/TFx4 (1422) 
= 
— 46/T+x4+04+ hn) 


xX 


Some special cases have support from built-in functions, such as loglp(x) = 
In(1 + x) and expm1(x) = e* — | for x + 0. If we use Log(1 + x) for computing 
In(1 + x), the subtraction that occurs is implicit as for the input z = 1 + x then 


In(z) = In(1 + x) 243 - (Z—1) = 2+ 4( 1° 
n => ip =x--r- =_ —ee ee = 
Zz x 3° 3° Zz a 3% 


As a result, internally to the call log(1 +x) the log function has to compute 
fl(i +x) — 1), which can result in poor relative accuracy if x is small. Instead, 
logip() provides a way of computing log(1 + x) accurately for small x. 

On the other hand, expm1() represents an explicit subtraction which is a sub- 
traction of nearly equal quantities if x is small. Both log1p and expm1 are available 
in the standard libraries of most languages. 


1.4.3, Numerical Instability 


Loss of accuracy does not necessarily happen all at once. Sometimes it can happen 
through a process that amplifies errors at each stage until the error overwhelms that 
computation. 


Consider the integral J, := i x” e* dx. This is easy to compute analytically for 


n=0: i, e-* dx = 1—e7!. We can create a recursive algorithm for computing 
these values: 


1 1 d 
Ing1 = i x" eX dx = -| gO) te) dk 
0 0 dx 
1 
et n+l —x|*x=1 ad n+1 —x q 

x" e ve a ye x 
(1.4.3) =-e'+(nt+Dh. 
With this recurrence we can compute /;, 2, .... However, problems become appar- 


ent using this scheme by the time we get to /9, as can be seen in Table 1.4.2 which 
shows the computed values for double precision. 


1.4 When Things Go Wrong 41 


The results are evidently wrong for n > 17 as J, > O for all n. Furthermore, 
I, ~ 1/(e(n + 1)) as n > oo. The error grows before it becomes evident: [5 — 
Ts © 4.1 x 107!5, yo — Tyo © 1.2 x 107", and 15 — Tis © 4.4 x 1075. To see why, 
consider 


Toi = f(a + Dh, — e7) 
=(n+)DLd+ea)-e%0 +62) while 
Ino = (n+ 1) I, —e7! 


So 
Tha — Ta = (n ae 1h - Tn) + [en i(n + 1)I, — €n,2 e'] : 


Assuming the quantity in [--- ] is O(u), which is reasonable before i, “blows up”, 
the error in the results grows according to 


Ina = La =(n = Ch _ i, ) Ou), so 
I, - ZL = O(n! u). 


Because n! grows so fast with n, it does not take an enormous number of steps before 
the error in 1, is larger than [,. 

The iteration J,4; = (n+ 1) I, -— e—! is an unstable iteration, as small errors are 
amplified with each step. We can turn this around to our advantage, by reversing the 
direction of the iteration: 


1 
(1.4.4) I, = n+l [Z n+l +e7 ne 


The recurrence (1.4.4) is actually very stable. It is so stable, in fact, that we can 
start with n = ng for some no such as no = 40 with a rough approximation for 
Di and work backward to obtain more accurate approximations than we get from 
the forward recurrence. Errors in the computed values of /,, for selected values of n 
are shown in Table 1.4.3. 

The results in Table 1.4.3 indicate full double precision accuracy for these results. 


1.4.4 Adding Many Numbers 


The apparently trivial problem of adding many numbers can reveal surprising depth 
in numerical computation. If we aim for maximal accuracy, such as implementing 
a special mathematical function for many people to use, or doing some other high- 
precision computation, we might want to find how to add numbers with the least 
error. 
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Table 1.4.2 Computed values of J, = i x"e* dx 


n| In n| In n In 

0| 0.6321205588285577 8 | 0.04043407756729511 | 16 | 0.009553035697593693 

1| 0.2642411176571153 9 | 0.03646133450150879 | 17 | —0.19592479861475587 

2| 0.1606027941427883 10| 0.03319523834515437 | 18 | —4.090450614851804 x 10° 
3} 0.1139289412569227 11| 0.03046341897041005 | 19 | —8.217689173820752 x 10! 
4| 0.0878363238562483 12} 0.02814500544388832 | 20 | —1.726082605943529 x 103 
5| 0.0713021781097991 13] 0.02615063504299409 | 21 | —3.797418521019881 x 10+ 
6| 0.0599336274873523 14| 0.02438008447346896 | 22 | —8.734066277140138 x 10° 
7| 0.05165595 12400239 15] 0.02220191040406094 | 23 | —2.096175943301578 x 107 


Table 1.4.3. Errors in computed values of /, using reverse recurrence (1.4.4) 


n I, -In 

0 —1.24 x 107!7 
5 +5.40 x 107!9 
10 —1.55 x 10718 
15 +1.48 x 107!8 
20 +8.57 x 10729 


Algorithm 6 Naive algorithm for adding numbers in an array 


1 function sum(a) 
2 s<0 

3 for i=1,2,..., length(a) 
4 s<—s+aqj 

5 end for 

6 return s 

7 end function 


The standard, or naive, algorithm for adding an array of numbers is shown in 
Algorithm 6. 

For the error analysis for Algorithm 6, we use the notations; to be the computed 
value of s after adding a; on line 4. Let 59 = 0. Then 34; = fl(s; +a;) = (G+ 
a;)(1 + «;) with |e;| < u. If nm = length(a), the returned value is 


Sr = doa [[U+e;), which leads to the bound 
i=l j= 


< > la;|(n —i + lu + O(nu)’. 


i=l 


n 
Sn — ) qj 


i=l 


This means that for maximum accuracy numbers should be added from smallest to 
largest. 
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But our analysis of the naive Algorithm 6 indicates that the more additions applied 
to a sum with a given term q;, the larger the bound on the roundoff error. We can 
think of this in terms of the depth of term in the sum: 


(C v2 (Cay + a2) + a3) E a4) #ede Qn-2) = Qn-1) + ay. 


If we reduce the depth of the terms, we have the possibility of reducing the error 

in the sum. One way of reducing the maximum depth is to split sums in the middle: 
pei Me = ee; Ae + Vp 41 Ge With m = [Gi + j)/2], for example. The max- 
imum depth is then [ log, n| and we can get bounds on the rounding error of 
* (log, n)u max; |a;|. Pseudo-code for this can be found in Algorithm 2. 

Finally, the pseudo-random character of roundoff error should remind us that 
sometimes a statistical analysis can be beneficial for understanding the behavior of 
roundoff error for long sums of terms of similar magnitude, as can occur solving 
ordinary differential equations. 


Exercises. 


(1) Carry out the following computations: initially set x to 2; then repeat x <— /x 
20 times, and then repeat x < x? 20 times. If everything was done in exact 
arithmetic you would get exactly 2 for the final value of x. What do you get? 
Can you explain why? [Hint: Note that (a + €)* = a? + 2ae + €?. If a is the 
exact value, ignoring the €” term, the error € is amplified by a factor of 2a.] 

The most common statistics computed from a data set x1, x2,...,xw are the 
mean x = (1/N) y x; and variance s* = (N — 1)7! you —x)*. There 


is a second, equivalent, formula for the variance s? =(N —1)7! (pa 1 7) _ 


(2 


a 


N X°]. However, they are not equivalent numerically. Create a set of N = 100 
values x; = 10° + v; where v; are randomly generated numbers uniformly over 
the interval [—1, +1]. Compute the variance s* by these two formulas. Which 
is more accurate? Explain why. 
(3) Compute the values of ((1 + x3 — 1)/x for x = 10-* fork =1,2,..., 15. 
Compute the limit of the expression as x — 0 using |’ Hospital’s rule. Re-write 
the formula ((1 + x)!/3 — 1)/x to give an equivalent expression (assuming exact 
arithmetic) that is more accurate for small x. 
The function f(x) = e* — 1, if implemented directly, is not accurate for x ~ 0 
because of the subtraction of nearly equal quantities. Show that evaluating the 
expression in floating point 


(4 


wm 


e*— 1] 


ini 


has a relative error of no more than unit roundoff u for 0 < x < /u. 

(5) The hyperbolic tangent function tanh(x) = (e** — e~*)/(e** + e~*) imple- 
mented like this gives NaN if x is large, for example, if x = 1000 when using 
IEEE double precision. Write a new implementation that gives accurate values 
but avoids generating NaNs. 
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(6) The expression (tan x — sin x) ie gives inaccurate values for x ~ O due to sub- 
traction of nearly equal quantities. Re-write this in an equivalent way (in exact 
arithmetic) that will give accurate values for x ~ 0. 

(7) The function g(u) = log(1 + e”) is a smooth approximation to max(0, u). How- 
ever, it suffers from overflow if x is positive and large enough. Implement this 
function so that overflow does not occur for any reasonable value of u. Reason- 
able values must include u = +10°. 

(8) The update formula (1.4.3) for computing ie x” e~* dx amplifies errors, while 
the reversed iteration (1.4.4) J, = a [Jae + e!] reduces the error. Implement 
a method to compute /,, for any given 7 using the reversed iteration and a starting 
value In < e—'/mforsomem > n. To test your method, compare the computed 
value 7, with [) x" e~* dx = Ye o(—D‘/(a +k + Dk). 

(9) If a © b are large positive numbers, which gives the smaller roundoff error in 
general: a? — b* or (a — b)(a + b)? Give bounds on the roundoff error for these 
two expressions based on the formal model (1.3.1) assuming no overflow or 
underflow. Which expression is more likely to produce overflow? 

(10) Write a recursive routine to compute the cardinality n(S) = |S|, the mean 
HS) = Voje9 xi, and sum ss(S) = )0j.9(%; — u(S))? for a set of indexes S = 
{i | k <i < £}. The core of the routine is how it computes (n(S), (S), ss(S)) 
from (n(S,), “(S)), ss(S))) and (n(S2), WCS2), ss(S2)) where S = S, U Sj and 
S19 Sy = Y. Do this in a way to avoid subtraction of large, nearly equal, quan- 
tities. 


1.5 Measuring: Norms 


We need to measure the size of an error, whether the error is a scalar, a vector, or 
a matrix. So we need to measure the “size” of a vector or matrix. There are many 
ways of doing this. Some measures of size are more suitable to certain applications 
(such as geometry or the worst-case error). But all measures have basic properties 
that must hold in order to use them to prove error bounds. 


1.5.1. What Is a Norm? 


A norm is a real-valued function ||-|| on a vector space that we use to measure the 
size of vectors in that space. Norms have the following properties: 


e ||x|| => 0 for all vectors x and ||x|| = 0 implies x = 0; 
e ||sx|| = |s| ||x|| for any vector x and scalar s; 
e |x + yl < |x| + llyll for vectors x and y. 


The last inequality is called the triangle inequality. Norms of matrices have those 
properties: 
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e ||A|| = 0 for all matrices A and ||A|| = 0 implies A = 0; 
e ||s Al] = |s| || A]] for all matrices A and scalar s; 
e ||A+ Bl] < |All + ||B|| for all matrices A and B, 


along with these additional properties provided the matrix and vector norms are 
compatible: 


e ||Ax|| < |All [xl 
e |ABl < |All BI. 


Example 1.1 Norm examples. 


Vector norms in common use include the following: 


1/2 
© IIx, = [0 il?) > _ /xTx for real x: 
@ [|X ].o = max; |x;|; 


© [xl = 20; bail. 


: 1 
These norms are part of family of norms: ||x||,, = be |x; ha ? tor p = 1. Note that 
IX Ilo = lim p00 I|x || )- 
Matrix norms can be induced by vector norms: 


A 
(1.5.1) (A= sip 
x40 |x| 


Note that the norms on the right are both vector norms. 
Induced matrix norms are compatible with the vector norms used to define them. 
Formulas are known for the induced matrix norms of the 1-, 2-, and oo-norms: 


@ ||Allo = VAmax(A? A) where Ajnqx(B) is the maximum eigenvalue of B; 


@ ||Allo = max; yi |aij 


e \|All; = max; >>; |ai)|. 


’ 


New norms can be created from old norms. For example, if D is an invertible matrix, 
then we can define the scaled norms: 


IxIlp,p = I|Dxllp- 


The corresponding induced matrix norm is 


Ax DAx DAD“'z 
Alig 2 wine ie pg Alo PAPA (z = Dx) 
fel Mvp Set ele er lizll, 
=||p4D"|,. 
Pp 


A special matrix norm that is easy to compute is the Frobenius norm: 
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1/2 


WAlle = | > fail? 
i,j 


The Frobenius norm is not an induced norm, but it is compatible with the vector 
2-norm: 
| Ax|l2 < IlAlle [lel - 


A fact that is important for theory is that any two norms ||-||(3) and ||-||(2) on a 
finite-dimensional vector space are equivalent in the sense that there is a constant 
c > 0 where 


1 
(1.5.2) 2 Ie llay < lela <e Illa) for all x. 


1.5.2. Norms of Functions 


How big is a function? It depends on the norm you use to measure it. The usual norm 
properties must hold: for any functions f, g: D—> R 


e || f\|| = 0, and || f|| = O implies f(x) = 0 for all x € D; 
© ||a f || = |e || f|| for any scalar (constant) a; 
elft+eall <i fll+ igi. 


Some examples of norms of functions [a, b] > R follow: 


© If lloo = Maxg<x<p | f (x)| for continuous f; 
fl =f If @| dx; 

1/ 
ellfl,= [Ae ircor ax| eee <p<OW; 


1/ 
© Wflhy =[L (seo +1 @r?) dx] for ts p <oo. 


Some examples of norms of functions D —> R where D C R¢ is bounded and closed 
follow: 


© Il flloo = Maxxep | f(x)| for continuous f; 

° Wf lly = Sp lf @)| dx; 

© fl, =[plf@)? dx] ”” for 1 < p <0; 

© Wfllio = Lp (IF @)I? + IVF GI?) dx]? for 1 < p < ov. 


There are also weighted norms, using a weight function w(x) that is positive except 
on a set of zero volume, such as 


2 balrer = Lp w(x) Lf ey? dx]? for 1 Sp<M. 
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Exercises. 


(1) Show that for any vector x we have ||x||,5 < ||xll2 < [Ixll,. 

(2) Show that for any vector x € R” we have ||x||, < ||x||, <n! ||x||, so that 
the 2-norm and oco-norm on R” are equivalent norms. 

(3) A\Show that any two norms on R” are equivalent. [Hint: First show that a given 
norm ||-|| is equivalent to ||-||,o-] 


3 b b 1/2 
(4) The norms on functions || f||,; = tb | f (x)| dx and || fll, = LV If? ax| 


over the interval [a, b] are not equivalent. Show first that || f ||; < /b—a || flo. 
Now find a sequence of function f;: [0, 1] — R where || f; ||, = 1 for all k, but 
| fell) > Oask — oo. [Hint: Set f,(x) = /k on the interval [0, 1/k] and zero 
elsewhere. ] 

(5) Show that the matrix 2-norm of an outer product | uv? | 7 1s given by | uv? | ,= 
ee 

(6) Use the Cauchy—Schwarz inequality (A.1.4) to show that || Ax||2 < ||All- ||xl2, 
and so ||A||- = ||Al]. for any matrix A. 

(7) The integral version of the previous exercise is to show that if g(x) = 


1/2 
fo kx, ») FO)dy, then Uigll, < [0 [2 lke, Pax dy] Iflla. Show this 
using the integral version of the Cauchy—Schwarz inequality (A.1.5). 

(8) Show that if we treat x7 as a 1 x n matrix, then |x” lle = ||x||,. From this, or 
directly, show that |x” y| < |x|]; Ilylloo- 

(9) Show that for any induced matrix norm (1.5.1) ||/|| = 1. From this show that the 
Frobenius norm || A||- is not an induced matrix norm. 

(10) Show that the 2-norm of both matrices and vectors is orthogonally invariant. 
That is, if Q and W are orthogonal matrices (Q~'! = Q7 and W~! = W") then 
|| Qx ||, = ||x||, and ||QAW]|, = ||All2 for vectors x and matrices A of appro- 
priate dimensions. Also show that | Al | = ||All2. [Hint: For the last part, use 
|| |p = maxy.jy),-19’x and y’ Ax =x" A’ y starting with the definition of 
Allo] 


1.6 Taylor Series and Taylor Polynomials 


James Gregory gave the first presentations of Taylor series representations for the 
standard trigonometric functions in 1667, although this had already been done in the 
1400s in India by Madhava of Sangamagrama. The general procedure was developed 
and described by Brook Taylor in 1715. The standard Taylor series is an infinite series 
whose convergence is required before it can be used for computations. By contrast, 
Taylor series with remainder give a finite Taylor polynomial and a remainder term 
that indicates the error in the polynomial approximating the original function. This 
remainder form is actually much more useful for numerical computation, as well as 
avoiding the convergence issues associated with the original infinite Taylor series. 
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1.6.1 Taylor Series in One Variable 


Here we show how to obtain Taylor series with integral remainder. 


Theorem 1.2 (Taylor series with remainder) If f is n + 1 times continuously dif- 
ferentiable, then 


1 1 
LO.) fa) =(@+t f@e-at5f"@e- a? +--+ Tf @E- a)" 


+ ES [o =f Gi di: 
n! Ja 


The polynomial f(a) + f’(a)(x —a)+ $f" (ay(x —ay+te-- + (A/a) f™ 
(a)(x — a)” is called the Taylor polynomial of f at a of order n. The remaining 
term (1/n!) fits (x — t)" f+) (t) dt is called the remainder in integral form. 


Proof We prove (1.6.1) by induction on n, starting with n = 0. 
For n = 0 we use the fundamental theorem of calculus: 


f(x) = fla) + i fd. 


Suppose that (1.6.1) is true forn = k; we wish to show that it holds forn = k + 1. 
Using integration by parts, 


ek (kD 1 . k+l] p(k+1) 
[ « —p* f&Y@at = mlz = [« t) l¢ (t)dt 
__ is ( — net) eed ool! . toil (x — kt] £42) 0 at 


= 4 kL p(k) a REL p(k+2) 
= eats @+p7f 6 rye+l p42) 9) abt, 


Then 


k a 
fa= VL Aho ate i: (x — 1k fe) at 


(by the induction hypothesis) 


k 
i oe . 
= + pi) (k+1) k+1 
= 2, a (a) (x — a)! + kl a pt (a) (x — a) 


4 7 <4 7 / (x oe FROM] dt 


k+1 


: 1 - 
So F(a) ay! + oe [9h fe @ at, 
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which is what we wanted to prove. 


The Taylor series representation of f 


1 1 
f@)= fat f'@R-a+5f"@R- ay toot Sf @e a)" +e. 


holds provided both the infinite series converges, and the remainder term R,,(x) := 
(1/n!) f(x — 1)" fF (1) dt goes to zero as n > oo. 

Another form of the remainder term that is less precise, but often easier to use, 
is the point form. By the integral mean value theorem, since the (x — t)” does not 
change sign fora <t < x, 


R(x) = < / ee —1)" f" (1) dt 


= < f*() fo —t)"dt 


1 1 
— — ¢(n+l) = n+l 
a Aa read com, 
eset FMC] @ — ay"! 


(n+ 1)! 


for some c, between a and x. It should be noted that c, depends not only on x, but 
also on n, a, and the function /f. 

The remainder term R,,(x) can be bounded for many well-known functions: if 
f (x) = e*, since f™ (x) = (d"/dx")e* = e*, for a = 0 we have 


R,(x) “x(x —a)"*! for some c, between 0 and x. 


ae 


As the exponential function is an increasing function, 


1 ax x n 
[Rn(x)| < Gap OD (acl 


This bound can be used to determine, for example, the order of the Taylor polynomial 
needed to obtain a specified error level. For example, to guarantee an error level of 
no more than 10~!? for |x| < 1/2, we would ensure first that 


1 


ae: pe Ix|!"t!< 107 — forall |x| < 


Nie 


Taking the worst case over —5 <x< +5 gives the condition 


1 
(n+ 1)! 


el oyt < 10°. 
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Table 1.6.1 Bounds on Taylor polynomial remainder term for f(x) = e*, a = 0, and |x| < 1/2 


n| 1 2 3 4 5 6 

2.061 x 107! 3.435 x 10-2 | 4.294 x 1073 | 4.293 x 10-4 | 3.577 x 1075 | 2.555 x 1076 
n| 7 8 9 10 1 12 

1.597 x 1077 8.874 x 10-9 | 4.437 x 107! | 2.017 x 107!! | 8.403 x 10713 | 3.232 x 10714 


Finding the smallest n satisfying this condition can be carried out by trial and error; 
values are shown in Table 1.6.1. Clearly n = 11 is sufficient to achieve a reminder 
less than 10~!? for |x| < }. 


1.6.2 Taylor Series and Polynomials in More than One 
Variable 


Taylor series with remainder formulas can also be developed for functions of more 
than one variable. The trick here is to first reduce the problem to a one-variable 
problem. Suppose f: R” — R. Provided f is continuously differentiable n + 1 
times, the function d(f) := f(a + td) isalson + | times continuously differentiable 
and so 


t 
$0) = $0) +¢' Or + oO +--+ ~eMOr4 + [ (t — u)"o@@t) (u) du. 
2 n!} n! Jo 


Then 


1d" |. 
aon Sa +sd)) 5.00" 


ld | 
5 qt FAtSd)gao? +---4 


f(at+td)= f(a) é. f(a+sd)|s—0 t 4 5 


1 t qttl 
fe 
n! Jo 


uy" el f(atsd)|,—, du. 
Computing the derivatives (d /ds)* f(a+sd) can be done in terms of the partial 
derivatives of f, although the formulas are not especially nice: 


k m 


d 
jalatsd= 


Hi oer ip= 


ak f 


8x1, 8%; 


zy (a+ sd) dj,dj, tee di,. 


To help us with the task of dealing with such terms, we introduce some new notation: 


m 


D* f(a)[v1,02,..., uJ = >. 


11 ,12,..,1h= 


gk 
(1.6.2) F(a) (0), (2)iy Hig. 
Ik 


i OX}, OX}, +++ 0 


The values 
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at glial f 
1.6.3 ——— (a) = a), 
( ) OX}, OX; pats OXi, ( ) ox” ( ) 
where @ = (@,..., @q) is a multi-index: a; is the number of times j appears in the 
list (i), io, ..., i,). Since 07g /dx;Ox; = 07g /dx; 0x; provided the second derivatives 
are continuous, where j appears in the list (i), i2,...,i,) does not matter, only 


how many times it appears. The value |w| = a; + a2 +---+ag =k is the order of 
the derivative. Multi-indexes can also be used to label multi-variable monomials: 
z® = zi'z5°-++z5", which has degree ||. A general polynomial in d variables of 
degree k can be written as 


(1.6.4) p(x) = CyX™. 


a: |ae|<k 


With the notation of (1.6.2), 


k 
aC + sd) = D‘ f(a+sd)[d,d,...,d]. 
Ss ———e 


k times 


Then the Taylor series representation of f(a + td) is 


1 1 
f(a+td) = f(a)+D! f(a[d]Jt + =D’ f@[d, dj? +---+—D"f@ld,d,..., djt" 
2 n! See ee eeey 
n times 
t 
(1.6.5) +5 [@- wD" fa +ud)idd a d]du. 
n! 0 ——~ 


n+1 times 


The quantity D! f (a) can be understood as the gradient vector of f at a: 


af /dx\ (a) 
af /dx2(a) 
Vf(@)= 


af /8%m(a) 


and D! f(a)[d] =d ov f (a), while D? f (a) can be understood as the Hessian matrix 
of f ata: 


0? f/9x,9x1(a) 07 f/9x,Ax2(a) -+- 0? f/9xX19Xm(a) 
0° f/Ax29x\(a) 8? f/Ax29x2(a) «+» 0? f/Ix29Xm (a) 
Hess f(a) = : : . , 


’ 


8 f/8%m 9x (a) 82 f/9%m9X2(a) - 9 f/8%m9%n (a) 
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and D? f(a)[d, d] = d' Hess f(a) d. Note that since 0? f/Ox;Ox; = 0? f/Oxj Ox; 
provided the second-order derivatives are continuous, Hess f(a) is a symmetric 
matrix. 

The quantities D* f (a) can be considered to be higher order tensors, which at 
an elementary level can be thought of as higher dimensional arrays of numbers. 
However, the fact that 47 f/4x;dx j= a? f/ax ;9X; means that these tensors have 


certain symmetry relations. In particular, if (e;,e2,...,e,) iS a permutation of 
(d;,dy,..., dx), 
(1.6.6) D‘ f a)[di,d2,...,dx] = D‘ f(@lei, e,..., ex. 


Also note that Dé f (a)[d,, dz, ..., dy] is linear in each of the dj's: 


(1.6.7) D* f@{d),..., uto,...,d;] = D* f(a)[d),..., Uu,..., diJ+D* f@ld),..., v,..., dy). 


This makes the function (d,, d2,...,d,) tb Dk f (a)[d, do, ...,4,] multilinear 
for k > 1 and not linear, just as the function (x, y) #> xy is linear in x and linear in 
y but is not linear in (x, y). 

The remainder term in integral form 


1 t 
= / (t —u)"D"*! f(a+ud)[d,d,...,d]du 
n. Jo —_ 


n+1 times 
can be represented in point form 


1 
(n+ 1)! 


D""! f(a+cd)[d,d,...,d] 
— 


n+1 times 


for some c; between 0 and ¢, since we are integrating a scalar quantity. 

We can bound Dé f(@ld,--- , d] in terms of the kth-order partial derivatives of 
f:if |0* f/0x;, 0x; +++ x;,(a)| < M for all (i1, i2,..., ix), then 
(1.6.8) |D‘ f (a)[d,--» ,d]| <M |ld\li. 


In general, we can define a norm (see Section 1.5.1) to measure the size of the 
derivatives of order k given a vector norm ||-||: 


(1.6.9) — ||D* f(a)|| = max {|D‘ f@[d1,--- ,dx]| | ||d;|| < 1 for all j }. 
From this we have the bound 


|D‘ f(a)ivy,..., v4]| < | D' F@|| lvill voll --- [vel - 
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Note that with this definition, IV f(@ll. = 
| D! f(a) | , and ||Hess f(a)|l. = | D? f (a) | , using the ordinary vector 2-norm and 
matrix 2-norm, respectively. 


1.6.3 Vector-Valued Functions 


We can apply the above results to vector-valued functions. Suppose that f': R” > 
RR”. Then for each index i = 1, 2,...,m we have 


1 1 
filattd) = f(a) + D' fi(@ld\t + =D? fi(@ld, d|r? +---+ —D" fi@ld,d,..., djt" 
2 n! SS 


n times 


1 t 
+ =f (t—u)"D"*! fa+ud)ld,d,..., d|du. 
nN. JO rnd 


n+1 times 
Stacking the components D* f;(a)[d,, ...,d,] we can form 


D* fi(a)idi,..., dx] 
: D‘ fo(ayidi,..., dx] 

(1.6.10) D' f(@ldi,...,d) = 
D* f,(ay[di,..., dx] 


With this notation, 


1 1 
f(a+td) = fi(a)+ D' f@l[d\t + =D’ fi@ld, dr? +---+ —D"f(@[d,d,..., djt" 


n times 


t 
(1.6.11) +o f (@—u)"D"t fatud)id.d,...,d]du. 
n! 0 -_——— 


n+1 times 


The mean value theorem and the integral mean value theorem do not apply to 
vector-valued functions. If f: R — R” withn > 1 with f differentiable, we cannot 
conclude that f(b) — f(a) = f'(c)(b — a) for some c between a and b. Take, for 
example, f(@) = [cos6, sin6]’. Then f (27) — f(0) = 0 but f’(c) 4 0 for any c. 
Because of this, when we are dealing with vector-valued functions, we must use the 
integral form of the remainder, rather than the point form. 

For vector-valued functions, we can define the norm 
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(1.6.12) |‘ F@| = max { |’ F@ia.-- di] | |a;| <1 for all j | so that 


(1.6.13) |D* fea)[v1,.... 9] = [O*F@I leu tell eel 


Exercises. 


(1) The Taylor cubic of tan x around x = 0 is x + x3/3. Give a bound on the maxi- 
mum error for |x| < 7/8. 

(2) The exponential function e* can be approximated by the Taylor polynomial of 
degreen:e* © 14+ x+4+x7/2!4+---+.x"/n!. What value of n will guarantee that 
the error in this approximation is no more than 10~!° for any x with |x| < 1/2. 

(3) The natural logarithm function In(1 + x) has a Taylor series x — x7/2 + x3/3 — 
+++ (-—1)?x"/n +--+. What is the remainder for the degree n Taylor polyno- 
mial? What value of n will guarantee an error of no more than 10~’ for |x| < 1/2? 

(4) Suppose we have the task of writing a natural logarithm function. A strategy we 
use is to extract the exponent field so we can use In((1 + x) x 2°) = In(1+ x) 
+e In 2, and we store In 2 to high accuracy. If we just use the exponent field we 
get 5 < 1+ x <1. What degree of Taylor polynomial will guarantee an error 
in In(1 + x) of less than 10~!° for x in this range? 

(5) A modification of the method of the previous question is to use either the exponent 
field or the exponent field minus one: write z = (1 + x) x 2° with sa <l+x< 
a anda > | chosen to minimize the worst-case value of |x|. What is that value 
of a? With this value of a, what degree of Taylor polynomial would be needed 
to guarantee an error in In(1 + x) of less than 107! for x in this range? 

(6) Let g(x) = exp(—1/x?) for x £0 and g(0) = 0. Show that the kth-order 
derivative g(x) = p,x(1/x) exp(—1/x”) where p, is a polynomial. Show that 
g (0) = 0 for all k. [Hint: Use I’ Hospital’s rule.] Find the fourth-order remain- 
der term. Give a bound on |g“ (c)| for |c| < 3. 

(7) For the function f(x) = 1//1+ x?, obtain the first four terms of the Taylor 
series of f(1/u) around u = 0 for u > 0. 

(8) The I’ function is a generalization of the factorial function (T(n + 1) = n! for 
n=0,1,2,...) given by the integral [(n + 1) = te t"e' dt. Write [(n+ 
)= ae exp(—t +n Int) dt and set ¢,(t) = —t +n Int. Show that for r* = 
n, $,,(t*) = 0. Expand ¢n(t) = bn(t*) + OP YV(t— 0) + SONG - AYP + 
O((t — t*)°). Use the approximation 


ra+1) 


ee 1 
= [ expidntt") + aint =) + Sonceyce = 2) [1 OWE =] at 


+00 1 
~ | exp(dn(t*) + by (t*)(t — t*) + son (tyr — ry) dt 


1.6 


(9 


YS 


(10) 
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to give Stirling’s approximation: n! ~ /27n (n/e)” as n > oo. Note that the 
integral can be expanded to being over | ae as this adds an exponentially small 
term to the value. 

In Exercise 8, show that 


1 
exp(bn(t)) = exp(dn(t*) + son t*V(e — t*)?) x 


1 1 
(4) (4 Ww) _ yky4 _ 4*y5 
men OD sory) t*)" + O(t a 


[+ gon'erve ry 4 ( 


Use this to get an improved asymptotic estimate for n!. Note that ee exp(a(t — 
t*)*) (t — t*)> dt = 0 for any a < 0. 

Use Taylor series to estimate a exp(—t?/2)dt: put t=x-+u so that 
ta exp(—t?/2) dt = exp(—x?/2) i exp(—xu) exp(—u7/2) du. Use the Tay- 
lor series expansion of exp(—u7/2) around u = 0 and integrate term by term. 
Note that i exp(—xu) uk du = (k — 1)!x~*~!. Note that integrating with the 
full Taylor series expansion does not give a sum that converges. Instead the 


resulting series is only asymptotic. 


Project 


Create a suite of functions exp, In, sin, and cos using Taylor series. To make this work, 
we need to reduce the range of the input. For example, we can use exp(x + y) = 
exp(x) exp(y) to restrict the Taylor series to evaluating exp(x) for |x| < 5 by com- 
puting exp(k) = e* for integer k by using repeated doubling. For natural logarithms, 
we use In((1 + x) x 2°) = Indl + x) + e In2. For sin(x) and cos(x) we first reduce 
x to |x| < z, and then use trigonometric addition rules to further reduce the range 
to |x| < 2/4 or even |x| < 77/8. 


Chapter 2 M®) 
Computing with Matrices and Vectors creek 


This chapter is about numerical linear algebra, that is, matrix computations. Numer- 
ical computations with matrices (and vectors) is central to a great many algorithms, 
so there has been a great deal of work on this topic. We can only scratch the surface 
here, but you should feel free to use this as the starting point for finding the methods 
and analysis most appropriate for your application(s). 

The three central problems in numerical linear algebra are as follows: 


e Solving a linear systems of equations: solving Ax = b for x. 

e Minimizing the sum of squares of the errors: minimize )~\”_,(Ax — b)? = (Ax — 
b)" (Ax — b). 

e Finding eigenvalues and eigenvectors: find x and A where Ax = Ax andx 4 0. 


These are not the only problems in numerical linear algebra, but they are the ones 
that have the most uses. 


2.1 Solving Linear Systems 


The most common operation in numerical linear algebra is solving a square system 
of linear equations Ax = b. This is a fundamental computational task in scientific 
computation. If the linear systems are large, solving the linear systems can easily 
have the greatest computational burden of any part of a numerical computation. 

On the other hand, solving a linear system can be easy. If it were not for roundoff 
error, we could solve a linear system exactly, provided that the matrix is square 
and invertible. Many students learn row echelon form for solving linear systems 
as undergraduates and often have solved linear systems of two equations in two 
unknowns in high school. 

Here we must be systematic and turn techniques into pseudo-code that can be 
implemented. 
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2.1.1 Gaussian Elimination 


Consider the linear system of equations: 


2x—- y+ z=-+6, 
—2x+2y —3z= -9, 
4x—-— y- z=+8. 
To solve this system of equations, we perform operations on these equations, which 


correspond to standard row operations on the augmented matrix containing the coef- 
ficients and the right-hand side: 


2 TG 
HD. 23) 20 
#11) 6 


The standard row operations are as follows: 


e swapping rows; 
e multiplying a row a non-zero scalar; and 
e adding a multiple of one row to another. 


The last option is the most often used. So for the above linear system, we add the 
first row to the second, and subtract twice the first row from the third: 


pe tae allie 
0 +1 -2/-3 
(aise 


This eliminates the first variable (x) from the last two equations. We then subtract 
the second row from the third: 


2-1 1/6 

0 +1 -—2/-3 

0 0 —-I1|-1 
The last variable (z) can then be solved easily: —z = —1 so z = 1. This can be 
substituted into the second equation y — 2z = —3 and solved for y = —1. Finally, 


these results can be substituted into the first equation 2x — y + z = 6 to give x = 
(6+y—z)/2=2. 
In general, we consider the coefficients of a linear system 
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Ay A12 413 +++ Ayn |b, 
a2 Az2 A423 +++ Arn|b2 
431 432 433 °+* A3n|b3 


Qni Gn2 4n3 *** Ann Dy 


We subtract multiples of the first row from the rows below in order to set the entries 


a21,431, --., Ani to zero. That is, row k is replaced by row k minus (ax; /a;1) times 
the first row. This results in a new augmented matrix 


Ay Q12 443 +++ Ayn |dy 
va / / ¥ 
0 ayy a5; +++ Ay, )D5 
‘3 / / va 
O ayy dy; +++ As,|D5 


! ’ ’ , 
0 G2 43° °° Ann bi, 


This has eliminated the first variable from all but the first equation. We can repeat the 
process using the second row to zero out the entries below the (2, 2) entry. That is, 
subtract (a;,,/a5,) times the second row from row k for k = 3,4,...,n. This gives 
a matrix 
Ay) 412 413 +++ Ain| Dy 
0 ayy Ay3 +++ Ay, | 5 
0 0 ayy +++ a5,|b3 


" yu |p 
0 0 473 °°* Ann Pa 


Continuing in this way we eventually come to 


Qi, G12 A13°°* Ain | dy 
/ vs / ‘A 
O ayy 43 °+* Ayn b, 


" " A 
O 0 a33-++ 3_ | d3 


0 0 0 --- a@-D|pe-D 


The matrix of coefficients is now an upper triangular one, that is, all non- 
zeros are on or above the main diagonal. This system of equations is now eas- 
ily solvable: a@—x, = bY gives x, = b"-)/a"—), The second last equa- 


oe a GD 2 Oy a 
tion is a? xn Fa xn = BC? which can be solved for xX»—1: Xn—1 = 


(Be aa) area This process can be repeated, working our way up 
the x vector until all of its entries are computed. This process is called backward 
substitution. 
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Algorithm 7 Gaussian elimination; overwrites A and b 

1 function GE(A,b) 

2 n <— dim(b) 

3 for k=1,2,...,n 

4 for i=k-+1,...,n 
5 Mik — ik / Ak 
6 
7 
8 


for j=k,k+1,...,n 
Ajj — Gij — Mikakj 


end for 
9 bj << bj aa Mixdy 
10 end for 


11 end for 
ole return (A,)b) 
13 end function 


Algorithm 8 Backward substitution; overwrites b 


1 function backsubst(U, b) 
2 n <— dim(b) 

3 for k=n,n—-1,..., 2,1 
4 s< by 

5 for j=k+l,...,n 
6 

7 

8 


S<—S— Ugh; 
end for 
be <— 8 [Uk 
9 end for 
10 return b // returns solution 
11 end function 


Pseudo-code for the elimination process, called Gaussian elimination, is in Algo- 
rithm 7. This code for Gaussian elimination overwrites A and the right-hand side 
vector b. 

The pseudo-code for backward substitution shown in Algorithm 8 overwrites the 
right-hand side b with the solution x for Ux = b where U is upper triangular. 

The computational cost of these methods can be determined in terms of the num- 
bers of floating point operations (+, —, x, /). These numbers are only a proxy for 
the total time taken, even when you factor in the rating for the number of floating 
point operations per second for a given computer. Since additions, subtractions and 
multiplication typically can be done at a rate of one per clock cycle, data movement 
costs are often much more important, and processor dependent. Division operations 
typically require 10-15 clock cycles each. 

Nevertheless, counting floating point operations gives us a good sense as to the 
asymptotic time taken by the algorithms at least for large n. For Gaussian elimination, 
lines 4-10 require 2(n — k)(n — k + 1) +3 flops for each iteration over k. The total 
number of floating point operations for Gaussian elimination is therefore 


n—1 


Y> Qn -— b(n —k +1) +3) ~ sn flops. 
k=1 
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By comparison, the number of flops required by backward substitution is 


Y > (Q(n =k) +1) =n? flops. 
k=1 


Clearly, the more expensive part for large n is Gaussian elimination. 


2.1.2 LU Factorization 


The process of Gaussian elimination for the entries of the matrix A does not depend 
on b, although the changes to b do depend on the entries in A. We can separate the 
two parts of this process. This leads us to the LU factorization, the most common 
method for solving small-to-moderate systems of linear equations. The main differ- 
ence between Gaussian elimination and LU factorization is just saving the multipliers 
m;x. We can put the multipliers into a matrix 


1 
m2, 1 
L—|m31 m3. 1 
Mp1 Mp2 Mp3 °°* 1 


The remarkable property of this matrix is that LU = A where U is the upper triangu- 
lar matrix remaining after A is overwritten in Gaussian elimination, that is, storing 
the multipliers enables us to reconstruct the matrix A. While it can be somewhat 
difficult to see this directly, there is a recursive version of the LU factorization that 
makes this easier to see. 


Write 
1 ar 
u=| az: a=[95 | (n Xn). 


Note that L is also lower triangular (that is, all non-zeros occur on or below the main 
diagonal). Note that the first row of A is not changed by Gaussian elimination. The 
remaining upper triangular matrix after elimination is 


ar? 
v= [5], 


where U is also upper triangular. Note that the multipliers in Gaussian elimination are 
obtained by setting m to c/a (since @ = aj; and m;; < a;;/a,). The effect of the 
first stage of Gaussian elimination (for k = 1) is to replace a;; <— aj; — mj\a),; for 
i, j > 1. This corresponds to replacing A with A’! = A — mr’. Under our implicit 
induction hypothesis, Gaussian elimination to A’ is equivalent to factoring A’ = LU. 
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Algorithm 9 Forward substitution; overwrites b 


1 function forwardsubst(L, b) 
2 n <— dim(b) 

3 for k= 1;2,.0.,;n 

4 S< by 

5 fort JH 1;2, 225,41 
6 

7 

8 


SS — lxjb; 
end for 
be — 8/LKk 
9 end for 
10 return b // returns solution 
11 end function 


Algorithm 10 LU factorization; overwrites A with U 
1 function LU(A) 
2 n <num_rows(A); L<I 
3 for k=1,2,...,n 
4 for t=kb 1,..4n 
5 Mik — Gik/Akk 
6 
7 
8 


for j=k,k+l,...,n 
Ajj <— Qij — MikAkj 

end for 

9 end for 

10 end for 

all Lik <— mik 

12 return (L, A) 

13 end function 


Then 


_| 1 ar! a|_4r a: 2 ar] | 
ww =| z|| 5 |= ma|LU + mr? 7 c|A’ + mr™ ah x |=4 


and we have reconstructed A. Although we have not gone through a formal proof by 
induction, we see that we do indeed have A = LU, which is the LU factorization. 

The strategy of LU factorization then is to ignore b in the elimination process, 
but save the multipliers in L. To compensate for ignoring b in the elimination, when 
we need to solve Ax = b we carry out two triangular solves: L(Ux) = b. Setting 
Zz = Ux we first solve Lz = b for z. Then we solve Ux = Z for x: 


solve Lz=b for z 
solve Ux =z for x. 


To solve Lz = b for z we use a variant of backward substitution called forward 
substitution, as shown in Algorithm 9. 

The cost of forward substitution in terms of floating point operations is identical 
to that of backward substitution. For the version of LU factorization developed here, 
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the diagonal entries €;, = 1, so that a division could be removed for each k. This 
gives a small reduction in the cost of forward substitution. 
Here we give a non-recursive pseudo-code for LU factorization in Algorithm 10. 
Often A is overwritten with both L and U. Since the diagonal entries of L are all 
ones, they do not need to be stored. Instead we can store the multipliers in the strictly 
lower triangular part of A, that is, store m;, = €;, fori > k in ajz. At the end of the 
LU factorization, the overwritten A contains 


Uj, Uj2 Uj3 +++ Uin 
£21 U22 U23 +++ U2 
£31 €32 U33 +++ U3n 


lat lon L3n st Unn 


2.1.3 Errors in Solving Linear Systems 


If we could carry out LU factorization with forward and backward substitution with 
exact arithmetic we could solve linear systems exactly. However, we still have the 
effect of roundoff error. This can affect the construction of A and b as well as the 
solution process. Further, it is worth understanding how errors in A and b produce 
errors in the computed solution x. These errors may come from other sources, such 
as measurement error and approximations in other numerical procedures. 

To understand errors in solving linear systems, we separate the issue of errors in 
the original data A and b from the issue of errors generated by the solution process. To 
deal primarily with errors in the original data, we look to the perturbation theorem 
for linear systems (Theorem 2.1). To deal with errors generated by the solution 
process, we look to the backward error analysis originally due to James Wilkinson 
(Theorem 2.5). 


2.1.3.1 Perturbation Theorem for Linear Systems 


Theorem 2.1 (Perturbation theorem for linear systems) Suppose A is invertible and 


Ax = 5b, 
(A+ E)x = b+d. 


Then 


(2.1.1) 


id K(A) fn Sr | 


IIx] ~ 1—K(A)CEW/IAID LIA _ I|bI| 


where K(A) = ||A|| Aq? | provided the denominator is positive. 


64 2 Computing with Matrices and Vectors 


We can consider the quantities ||d || / ||b|| and ||£|| / || A|| to be the relative errors 
in the right-hand side and the matrix. The quantity K(A) = ||A]| | AT! | is called the 
condition number of A and represents in a general sense how much relative errors in 
the data are amplified by solving the system of equations. This theorem says nothing 
about solution process or roundoff error. 

Before we prove the perturbation theorem, we need a basic result in inverses on 
perturbations of the identity matrix. 


Lemma 2.2 If — ||F || < 1 then U+F)!=I1-F4+F?-F34+.--., 
| + F)'| < 1/0 = IF), and || + Fy! - 1 < FIl/G - FID. 


Proof (Of Lemma 2.2). Suppose that || F'|| < 1. By induction using properties of 
matrix norms, we can see that | F| =|FFe| <|FI |F | <---< Fi. 
The infinite matrix series ] — F + F* — F* +--- converges as its partial sums are 
a Cauchy sequence: if M > N then 


M N 
ea iia a a Ds 


k=0 k=0 
M M 
=| >> Cris DO FI 
k=N+1 k=N+1 
1— F M-N-1 F N+1 
_ ppp Lol IF 


< 
1 — ||F'\l 1— || Fl 
which goes to zero as N — oo. To show that 
Ce a7 SFP a as, 


we note that 


N 
(+ F)\\(-F)§ =1-(-F)%*'. 
k=0 


Thus, taking limits as N — oo gives 
[oe 
U+F)) (-F¥ =I 
k=0 


and thus (J + F)7! = °22(-F). 
To obtain the bounds we note that 
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De Fy 


Il Fl 
I+F)!-1 FU+F)! a Ate 


J+ PF) ] = 


<)_ llFik= and 
3 = 


as we wanted. 


With this Lemma ready, we can start the main course. 


Proof (Of Theorem 2.1). Premultiply (A + E)x = b+d by A7!: 


(+A 'E)® =A '(b+d)=A'b+A'd 


=x+A'd 
=(1+A'E)x—A'Ex+A!d 
(2.1.2) =(1[+A'E)x+Al(d—- Ex). 


The matrix J + A~'E is invertible: 


1 < A : E =K A i 1, 
|A E| = | | El] = K( i < 


so 1 + A~'E is invertible by Lemma 2.2. Pre-multiplying (2.1.2) by 7 + A7'E)7! 
and subtracting x, we get 


¥—x=(1+A'E)'A'(d —- Ex). 
Taking norms: 


[¥ —x|| = |U+ A 'E)'A'@— Ex)| 


<|U+A7'E)"| A7'] ld - Exl 
<4 ate] fatty ddl + El lel) 
eo Ee io + HEL Lt | 
~ 1-|A'E b 
ene E |Axil + HEN 
~ 1-|A'E b 
-1 
< a E Al yeh TS Wal Ix 1 
JA-PWAM Mel Pld VE 
<"T- [A=] Lior ar} 
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Then 
ex A pal [i | 
Ix ~1—JAE] Lien” Al 
Note that || A]| | AT! | = (A), the condition number of A in the given matrix norm. 
Also, 


|ATE| < Att] el = AT AI EN / AID = 6(A) (LEW / IAI. 
So provided «(A) (||E|| / || Al]) < 1, we have 


1—|A'E| => 1—«(A) (IEI/IAID, and 
1 1 
< . 
1— AE] ~ 1=«(A) EN /IAID 


V 


That is, 


ex) — [ANAL Ee ay 
[xl ~1—]A~e] Lil Al 
; K(A) fA n Ly 
~ T= (A) (VEN/TAD L168 | TAT 


as we wanted. 


How can we use Theorem 2.1? 

First we need to note that «(A) = ||Al]|A7'|| > || AA~!|| = [IZ]. Note that 
||7|| = 1 in any induced norm; even in a matrix norm that is not induced, ||J|| = 
ZT] < ZI ZI = LI)? so 1 < |||]. In either case, «(A) > 1. 

If K(A) (||Ell / |All) < 1 then the denominator 1 — K(A) (||E|l / || Al]) © 1 and 
the relative errors in the data ||d|| / ||b|| and || Z| / || All are amplified by a factor of 
close to K(A). The bound on the relative error in the solution (||% — x|| / ||x||) can 
never be reduced by a small «(A). Instead, we aim to prevent «(A) from becoming 
“too large”. How large is “too large” depends on the application, but we definitely 
want to keep «(A) from becoming larger than 1/(unit roundoff), as then any roundoff 
error in the data or solution process can completely destroy any accuracy in the 
solution. 


2.1.3.2 Wilkinson’s Backward Error Analysis 


In the years 1947-1948, there were two articles about the error in numerical solution 
of linear systems of equations by Gaussian elimination or LU factorization with 
floating point arithmetic: one by John von Neumann and H. Goldstine [109, 256] 
and the other by Alan Turing [249]. The conclusions of these articles tended to be 
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limited in scope, or even pessimistic, about the accuracy of the computed solutions. A 
few years later, the English mathematician James H. Wilkinson [261] discovered an 
example of LU factorization in his work which had lost 5 digits of accuracy, but the 
reconstruction error A — LU showed more than 10 digits of accuracy, where L and 
U are the computed L and U matrices. A more modern example of this phenomenon 
can be seen in the following example: 


n < 20 

A<+([l/G@+jy-)D|ij=1,2,...,n] // nxn Hilbert matrix 
x < random n-dimensional vector 

b << Ax 

solve Ax=b for x 

print |x —*X]| // can be big... 

print ||Ax —b|| // always small! 


As above, we let Le U denote the computed L and U factors in the LU factorization 
of A. 

Wilkinson’s method of analysis shows that the computed L and U factors as well 
as the computed solution ¥ are, in fact, the exact solutions for nearby systems : 
(A+ E)¥ =bandA+E= LU, with E and E’ “small”. 

The basis of these results is the following model of floating point arithmetic: 


fizaoy)=@oy)Ut+e,  lelsu 


for the arithmetic operationso = +, —, x, /.Here f/(expression) denotes the result 
of the expression with floating point operations. Also u > 0 is the unit roundoff of 
the floating point system. For IEEE double precision, u * 2.2 x 107!°. 

The approach taken here is based on that of Higham (see, for example, [125]), 
which in turn is based on an analysis in the textbook of Stoer and Bulirsch [240] for 
LU factorization. To start, let 


nu 


(2.1.3) VY) = 
l—nu 


4 ifnu < 1. 

The important properties of 7, are that 1+u <(1—u)>'}<1+~7 and (1+ 
eA + ye) < 1+ Yeze for k, 2 = 1,2,3,.... Note that an alternative formula for 
‘Yn that has these properties is y, = exp(nu(1 + u)) — | provided u < 1/2. 


Lemma 2.3 /f |d;| < wand p; = +1, for alli with nu < 1, then 


(2.1.4) [[a+6)” =14+6,, [1 <m- 


i=1 
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Proof To start mathematical induction, note that condition (2.1.4) is clearly true for 
n = 1. Assume that (2.1.4) is true for n = k. We show that it is true forn =k + 1. 
Note that + 7 + 17 S Yke+e- Consider 


k+1 k 
[] a+" = IT (+ iv (1+ des)" 
i=1 


j=l 
=(l+@) 0+ daa). 
If pepi = +1, Oca = Oe + Oee1 + OeOx41 and so 


Orsi) < Ox) + lOxeil + deer] |e 
SMM EMM SS Veh 1- 


If peor = —1, Opp = Oe + Oe41 (1 + 0) / 1 + 6,) and then 


Oe4il + 16x) 
1 — |dx411 
S% +N +) S Mes 


|Ox+1] < Ox] + 


Thus, by the principle of mathematical induction, |@,| < yp, for all n with nu < 1. 


We can use this lemma to produce another lemma which treats common calculations 
in numerical linear algebra. 


Lemma 2.4 /fin an algorithm, s <— (c - yo, abi) /by is evaluated in the com- 


mon order by correctly rounded floating point arithmetic, then the computed value 
of s (denoted’s) satisfies the following inequality: 


sl 
<7 (i + > ay . 


i=1 


k-1 


C= Se aibi — Sb; 


i=1 


Proof Consider the algorithm 


So <——C 

for i=1,2,...,k-1 
Si <— Sj;_1 — a;b; 

end for 

Sk <— Sp-1/dDr. 
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The computed value of s; @ = 1, 2, ..., k — 1) is 


Si = SIG — a;b;) 
= (sj-1 — fl(aibj)) A + 6)) 
= (S;-1 — ajb; (1+ &)) (1 + 6) 


with |6;|, |e;| < u. Thus 
k-1 k-1 k-1 
K-1 =c]]d+46)- do aibi +6) I] (1+ 4;). 
i=1 i=1 j=it+l 


Finally, 


Sk = fl (Ge-1/be) 
= (Sk-1/be) (1 + 5) 


~ 


with |6,| < u. Because of this, 
Sed— = SK—1 1 + OK) 


k k-1 k 
= c[ Ja + 6;) — So ajbi (1 + &) I] (1 +4)) . 
i=l i=l 


j=i+l 
Multiplying by 1+ 0 = Tess (1 + 6;)~| produces 


k-1 i 
Soe 1 +6) =e- ab t+e)[] (1+ 6)~ 
i=l j=l 
k-1 
=c—) aj; (1 +63) 


i=1 


with |6;| < 7 < y for alli < k. Re-arranging then gives the desired conclusion. 


Now we are ready for the proof of Wilkinson’s theorem. One of the important impli- 
cations of this result is that we should try to avoid having large entries in L and U 
matrices. 


Theorem 2.5 (Wilkinson’s backward error result.) Suppose that A is ann xn 
matrix and b an n-dimensional vector of floating point numbers. Further suppose 
that L and U are the computed matrices for the LU factorization of A, and X is the 
computed vector for the solution of Ax = b through L, U and forward and backward 
substitution. Then 
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A+E=L0, |E'|<m|L 
(A+E)¥=b, |El<mGB4+% 


where |B| is the matrix with entries |B|;; = |b: ;| being the absolute values of the 
entries of B. 


Proof Algorithm 10 for the LU factorization is equivalent to the computations 
(including roundoff errors): 


Uij <— aij — ) litkj, J zi, 
j-l 

eij = (« = ) tus /U jj jJ<i 
k=1 


with evaluation of the sums in the natural order. Recall that @;; = fi = |. From the 
previous lemma, 


i-1 
aij — Lixttkj — Citi; = » |Gie| |e J = i, 
k=1 
F ; 
aij — > like; SV 5 |e; j<i. 
k=l k=1 


Then 


or equivalently ee Beak 
A+E’=L0, |E'| <>, |E||G]. 


__ Using forward and backward substitution, we solve Ly = b (with result y) and 
Ux = ¥ (with result ¥). These equations are equivalent to 


it 
wt Giy; =b;, and 
j=l 


n 
UjiXi + y UijxXj = Yi- 
j=itl 
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Following the previous lemma, we have 


1 i 


bj — > £:;9;| <n = |e:;| [55 | for all i, 
1 j=l 


— 
ll 


n n 
Fi — Do MyFj] Sm D | [ieij| [Ej] for alli. 
j=i jai 


Then, we have 6¢;; and du;; such that 


i=l 


’ 


n 
Yi= 5 (ij; + duj;) X;, |5u;;| =n |; 
j=i 


’ 


for all i and j. Expressed in terms of the matrices L and U ' (L + 6L) y = b and 


(U + 5U) ¥ =F with |5L| < Yn |L| and |5U| < Yn |O|; so 


(L+ 5L) (0+ 6U)#=65. 


Thus, we have (A + E)X = b with E = (LU —A)+6L OU +L6U + 6L6U. We 
have already shown that |Lo _ Al <n Z| \U » OL] < Yn Z| and |dU| < Yn |C}. 
Therefore 


|E| < 3% |E| || +2 |£| |0| 
=n B +n) [E| |G 


’ 


as required. 


2.1.4 Pivoting and PA = LU 


Wilkinson’s backward error analysis showed the importance of keeping the size of 
the factor matrices from growing excessively large. However, LU factorization does 
not in itself guarantee that. The weakness comes from the division on line 5 of 
Algorithm 10: mix <— ajx/axx. If ayy = 0 then the algorithm fails. If aj, * 0 then 
the algorithm can complete, but we may have very large entries in L, and therefore 
also in U. This can cause numerical problems. 
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The LU factorization will fail for 


01 
ao |ai| 


since fork = | we have a;; = 0, and we get division by zero. But this matrix is invert- 


ible. Its inverse is EB al The condition number ko(Ag) = ||Aolloo | 49‘ |], = 


2 x 2 = 4 which is not large. We could replace this zero entry with something small, 
call it e £ 0. The perturbation theorem for linear systems (Theorem 2.1) indicates 
this would only result in an error of size © 2€ using the oo-norm. Let us take € to be 
a power of two, but smaller than unit roundoff. Taking € to be a power of two ensures 
that no roundoff error will occur in multiplication or division with e. 

Then computing the LU factorization of 


e|1 
te F | 
in exact arithmetic will result in 


1 € 1 
t= [tar]: u=| as: 


Unfortunately, since € < u (unit roundoff), the computed value of 1 — (1/e) is actu- 
ally —1/e as shifting the “1” to align the exponents results in the mantissa being 
completely shifted off. So the computed L and U are 


a 1 A € 1 
a Fea “|; o-| a): 


ie an e|l 
These are not close to a factorization of A, or Ag: LU = 


1/0 
occur? Wilkinson’s backward error analysis (Theorem 2.5) shows that the error in 
the reconstruction of A, is bounded by approximately nu (| A| + |Z | | U |) withn = 2. 


. Why does this 


But, in this case, 


ss e} 1 . 
L| |u| = i an, and the (2, 2) entry is very large. 

We need to avoid these large values by avoiding division by (relatively) small 
pivot entries. We can do this by using row swaps, even if ax, 4 0. The result is not 


quite an LU factorization. Instead we create a permuted LU factorization: 
(2.1.5) PA=LU, 


where P is a permutation matrix. 
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2.1.4.1 Permutation Matrices 
A permutation matrix is the identity matrix with the rows shuffled. As a result, every 


entry is either zero or one; every column has exactly one entry that is one; and every 
row has exactly one entry that is one. Examples include the following: 


0100 
01 es 0010 
10}’ 001 , 1000 
0001 


Permutation matrices can be represented efficiently in memory using simple arrays 
of integers. The above examples would be represented by 


[1,2], [2,1,3], [3,1, 2,4]. 


An array [7(1), 7(2), ..., 1()] represents the permutation matrix P where Pe; = 
ex), j =1,2,...,n. Note that every integer from one to n is listed in the array 
exactly once, so that 7 is a permutation of {1,2,..., m}. 

Permutation matrices P shuffle the entries of a vector, so that Px has the same 
entries as x, but in a different order. This means that || Px||,, = ||x||, for any 1 < 
p < oo. In particular, we can take p = 2, so that || Px||} = ||x||j and x7 P? Px = 
x’ Ix for all x. Thus P? P = I and P? = P~!. That is, permutation matrices are 
orthogonal. 

The vector array subscripting features of MATLAB, Julia, and R enable us to 
apply permutation matrices without forming an actual permutation matrix. Array 
comprehensions in Python can achieve the same effect. 


2.1.4.2 LU Factorization with Partial Pivoting 


To see how to modify our LU factorization algorithm let’s use the recursive approach: 
what we will do is compute P, L, and U satisfying PA = LU. Forn > | we write 
[cir 
A= a 
c\|A 
In order to ensure that (at least) the entries of LZ are not large, at the start of each 
stage (for each k) we find the row i* which maximizes |a;,| over i > k, and swap 
rows k and i*. This strategy is called partial pivoting and ensures that after swapping 
these rows, |mix| = |aix/axg| < 1. That is, the entries of L are never more than one in 
magnitude. If, before swapping rows for stage k, we have |agx| > |a;z| for alli > k, 


then no swap is carried out. 
Suppose we apply our row swap algorithm for the first stage (k = 1) swapping 


rows ij and k = 1: 
a’ (r’)? 
PjAz= re : 
c| A’ 
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Then we want to factor P, A: setting m <— c/a’, and B <— A’ — m(r’)’ we get 

al (r’)? 1 a’ (r')T 

c| A’ ~ | mil B : 


We can recursively factor PB = LU witha permutation matrix P. This gives (noting 
that P~'! = P’); 


a’ (r')T 1 a’ (r')T 
c’| A’ m|I P™LU 


: | = = | 
5 | PA=|—— —|=LU. 
P Pm|L U 


Putting P = : 3 | P; gives the permutation matrix for PA = LU. Note that we 


need to use the permuted multipliers Pm instead of the original m. This means 
that any permutation of the rows that we apply, we have to apply to the previously 
computed multipliers. Algorithm 11 shows a non-recursive pseudo-code for LU fac- 
torization with partial pivoting. 

Since part of the purpose of using partial pivoting is to keep the size of the entries 
in L and U from becoming large, we should investigate just how large these entries 
could become. It is clear from the way the algorithm is designed that the entries of 
L cannot have magnitude greater than one. But what about U? The entries of U are 
the entries of A being overwritten. If we use a superscript as to indicate the value 
of aj; after stage k, then 


(k+1) (k) (k) fa 38 
aij; <— Ajj) — MinQy; fori, j >k. 


ae 


k+1) 
Jj 


Since |mj,| < 1 we get la} 
bound 


al) - lat | < 2max,,, |a®|. So we obtain a 


max |ay'!)| < 2max|apy. 
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Algorithm 11 LU factorization with partial pivoting, overwrites A 


1 function LUPP(A) // overwrite A 
2 for k=1,2,...,n—1 

3 i* <— arg max;>, |ajx| 

4 if aj; =0 then skip rest of loop 
5 for j=k,k+1,...,n 

6 swap aj«; with axj 
7 

8 

9 


end for 
for j=l,...,k-1 
swap mj; with mg; 
end for 
for t=]k 1, ena4n 
Lik <— ik /Akk 
// do row operations 


aij — aij — Cikak; 
end for 
end for 
end for 
end function 


0 
1 
2 
3 
4 
5 
6 
a 
8 
io) 


Consequently max, |u pq | ca ama 310 la pq | . Wilkinson [261] found an example 
in which this actually occurs 


tL 0 O«=1 
= 
w, =| —b-l dd 
= ees eee | 


In the first stage of LU factorization, the entries below the (1, 1) entries are zeroed out; 
no pivoting is done since all entries below the (1, 1) entry have the same magnitude 
as the (1, 1) entry. No entry in columns 2 through n — 1 are changed. However, the 
entries below the (1, 7) entry are doubled. This gives the matrix 


10 0.:--1 
01 0.---2 
wi) O-1 1 -2 
0O-1-1.---2 


Applying stage k of the LU factorization we see the doubling of entry (i, n) fori > k. 
At the end of the factorization, the (n, n) entry has the value 2”~!. 

This exponential growth of the size of entries during the LU factorization process 
is rarely seen in practice. 
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2.1.4.3 Complete Pivoting 


Complete pivoting requires finding the pair (i, 7) that maximizes |a; (| overi, j > k, 
and swapping both rows and columns. This means searching over (n — k + 1)? entries 
of the matrix, making a total of ~ in comparisons just for carrying out the pivot 
search. This has the same asymptotics as the number of flops needed to carry out LU 
factorization. For this reason, complete pivoting is usually not used. 

While partial pivoting gives a factorization PA = LU with P a permutation 
matrix, complete pivoting involves column and row swaps which can be made inde- 
pendently, leading to a factorization PAQ?’ = LU with both P and Q being per- 
mutation matrices. 

Wilkinson [261] was able to show that max, , |u | / Max p,q la | is bounded 
above by 


n'/2(2 . 31/2 ; iss _ sq yt bass Cn O/2Dd4nn) asn —> 00. 


The actual asymptotic behavior of max ,p 4 |u Ha / Max p,q |a | appears to be much 
better than this, although the true asymptotic behavior is unknown. 


2.1.4.4 Diagonally Dominant Matrices 


A (row) diagonally dominant matrix A is a square matrix where 


(2.1.6) lai|= > |aij{ for alli. 
jiei 


The matrix is called strongly diagonally dominant if the inequality in (2.1.6) is strict 
for all i. 

Strongly diagonally dominant matrices are invertible: if D is the diagonal part of A 
then strong diagonal dominance implies that | D-'(A—D) | oo ~ 1. By Lemma 2.2, 
I + D~'(A — D) is invertible. Pre-multiplying by D, which is invertible as D has 
non-zero diagonal entries, we see that D + (A — D) = A is also invertible. 

Partial pivoting is not necessary for diagonally dominant matrices. 


Theorem 2.6 Jf A is an invertible diagonally dominant matrix, then the LU factor- 
ization without partial pivoting will succeed. 


Proof We show that if A is an invertible diagonally dominant matrix and 


44" ua mm ae alr? . 
ciA c/all B 
then B is also invertible and diagonally dominant. Note that det A = a det B 4 0 so 


a # Oand B invertible. We now show that B is diagonally dominant. Note that for all, 
leg e e eaey |a;;| < lajj| and |r;| + pares |r ;| < |o| by diagonal dominance of A. So 
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De lel = Do leis — errj/o] < Do [au] + 


JAAS JAAS JAAS JAAS 
~ Ci 
< (ail — ei) + |=] dal = brid) 
a 
~ Ci 
S |aii| — lei| + lei] — J —] lil 
~ Ci ~ Ci 
= |aji| — is Iri| < aii — —ri] = |Diil , 
a a 


as we wanted. 


2.1.5 Variants of LU Factorization 


(2.1.7) A=LL" Cholesky factorization, L factorization 
(2.1.8) A=LDL!' _ L lower triangular, D diagonal. 


The Cholesky factorization of a symmetric matrix requires the matrix A be positive 
definite as well: 


(2.1.9) z’Az>0  forallz £0. 


While the LDL’ factorization does not require that A is positive definite, the fac- 
torization can be numerically unstable if A is not positive definite. An improvement 
is the Bunch—Kaufman (BK) factorization [38, 39] which is an LDL’ factorization 
where D is a block-diagonal matrix with | x | or 2 x 2 diagonal blocks along with 
symmetric row and column swaps. The BK factorization is intended for situations 
where A is symmetric, but not positive definite. 

There are variants of LU factorization that use block matrices for performance 
improvements. These have no effect on the total number of floating point operations, 
but do change the way memory is accessed. 


2.1.5.1 Positive-Definite Matrices 


Positive-definite matrices arise in a number of contexts: in statistics where, for exam- 
ple, variance—covariance matrices are positive definite; in optimization where convex 
functions can be identified by having positive-definite Hessian matrices; in physics, 
quadratic energy functions are generated by positive-definite matrices; in partial 
differential equations, where the discretization of elliptic equations leads to positive- 
definite matrices. 
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We take (2.1.9) as the definition that A is positive definite. Many authors assume 
that when a matrix is described as positive definite, it must also be symmetric. Here we 
do not.! If a matrix is both positive definite and symmetric, we will say so explicitly. 
We do assume, unless otherwise stated, that a matrix is real. For a complex matrix 
A we modify definition (2.1.9) to 


(2.1.10) Rez’ Az>0 __ forall complex z £ 0. 


Here Z is the vector z with the entries of z replaced by their complex conjugates. 
Note that the condition “z’ Az > 0” implies that Z” Az is real. In the complex case, 
instead of asking for A to be symmetric (A? = A), we ask for A to be Hermitian: 


(2.1.11) A =A; that is, ag, =a,e _ forallk, @. 


A real matrix A is positive definite if and only if its symmetric part (A +A’) is 
positive definite. For complex matrices, A is positive definite if and only if (A + 


=r a : ; ‘ ‘ . 
A ). As an example of a positive-definite matrix that is not symmetric, consider 


rei 
01 

From definition (2.1.9), positive-definite matrices have a number of important 
properties: positive-definite matrices are invertible; if A and B are positive-definite 
matrices, so are A + B, aA for a > 0, and A~!. Symmetric positive-definite matri- 
ces can also be understood in terms of eigenvalues: all the eigenvalues of a symmetric 
matrix are real, but the eigenvalues of a symmetric positive-definite matrix are posi- 
tive: if Av = \v and v ¥ O then v! Av = Av’ v > 0 so \ > O. In fact, a symmetric 
matrix is positive definite if and only if all its eigenvalues are positive. 

Clearly the n x n identity matrix is positive definite. Also if A is positive definite, 
and X is invertible, then X7 AX is also positive definite. The diagonal entries of a 
positive-definite matrix are positive. 

In spite of the ubiquity of positive-definite matrices, actually proving that a given 
matrix is positive definite (or not) can be achallenge. Fortunately, Sylvester’s criterion 
gives a fairly easy way to tell (apart from looking at all the eigenvalues). 


Theorem 2.7 (Sylvester’s criterion) If A is ann x n symmetric real or complex 
Hermitian matrix, then it is positive definite if and only if 


Q\1 G12 +++ Aik 


a2) 422 +** AK 
(2.1.12) dt| ~~ | |>0 fork =1,2,...,n. 


Qk Aka + ** Akk 


' Stoer and Bulirsch [241, pp. 180-181] assume “positive definite” implies “symmetric” while 
Golub and van Loan [105, pp. 14-142] explicitly allow non-symmetric positive-definite matrices. 
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Related to positive-definite matrices are (real) positive semi-definite matrices: 
(2.1.13) z’Az>0 forall z. 


Sums of positive semi-definite matrices A + B are positive semi-definite, and if in 
addition, either A or B is positive definite, then A + B is positive definite. For any 
X (even rectangular), if A is positive semi-definite then so is X’ AX provided the 
matrix product is well defined. A symmetric positive semi-definite matrix A that is 


also invertible is positive definite. Symmetry is important here: Ee | is both 


positive semi-definite and invertible, but it is not positive definite. 
Symmetric matrices B have real eigenvalues and an orthonormal basis of eigen- 
vectors. This is equivalent to the existence of an orthogonal matrix Q where 


The columns of Q are eigenvectors of B. If B is symmetric and positive definite, 
then all these eigenvalues are positive: 4; = q/ Bq; > 0 where q; is the ith column 
of Q. Also, || B||z = max; ;. To see this, any vector x can be written x = )7"_| ciq; 
so for symmetric positive definite B, 


| Bx Din ies) 
(2.1.14) |Bll. = max xl — moe —_—— = max j;, and 
¥ x c i 
. ete: 
T n 
x" Bx eat 
(2.1.15) max = max Dick t= max ); 
x40 xl x c#0 @ i 


2.1.5.2 Cholesky Factorization 
The Cholesky factorization of a symmetric matrix A has the form 
A=LL', L lower triangular. 


If A has a Cholesky factorization, then A is symmetric and positive semi-definite: 


zl Ag =z" LL7z = (L"z)" (Lz) = |L"2|, = 0. 
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Algorithm 12 Cholesky factorization; overwrites A 
1 function cholesky(A) 


2 for k=1,2,...,n 

3 lek <— fAkk 

4 for j=k+1.,..., n 

5 lin — Ajx/lrk 

6 for i=k+1,..., J 

7 aij <_ aij — Lin € jk 

8 aji <—ajj_// optional 
i] end for 

10 end for 


a end for 
12 return L 
13 end function 


But if A is invertible and has a Cholesky factorization, then A must be positive 
definite. Our algorithm for the Cholesky factorization of a symmetric matrix will 
assume that A is positive definite, rather than just positive semi-definite. 


We can recursively construct a Cholesky factorization of a symmetric positive- 
T 


definite matrix as follows. Let A = | a be symmetric and positive definite. The 
a 


base case for the recursion is where A is a 1 x 1 matrix: A = [a] = [A] [A] where 
A= Ja. If A isn x nwithn > 1, we note that a = ef Ae; > 0. Let \ = ./a, and 
£=a/X. Then 


ala™ r| AEP | A ! der 
fala] | ejA—2ee? T| Lelrj| |A—ee? all 
The matrix in the middle is also positive definite, since 


ae) -Caad oET 


|A — ee? ely [7 


The inverse of lat exists as its determinant is \ > 0. Thus A — £2" is also 


positive definite. It is symmetric as A and eer are both symmetric. Thus, we can 
recursively compute a Cholesky factorization A —#" = LL’. Thatis, 


+ (Ae) A] = 


A non-recursive pseudo-code for Cholesky factorization is given in Algorithm 12. 


The number of floating point operations for Cholesky factorization is 


~ insas n— c, about half the number needed for LU factorization of A. 
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While halving the number of operations may seem like a major benefit for this 
method, in practice, memory movement techniques can be much more important for 
performance. Cholesky factorization, however, is able to take advantage of many of 
these issues. One of the benefits of the Cholesky factorization is the pivoting is not 
necessary for numerical stability. In particular, if A = LL? then 


ay = ef Ae; = ef LLTe; = |L7e;|, = &, 


so the entries of the ith row of L are bounded by ./a;;. Thus the entries of L can 
never exceed ||A I! *| Because no pivoting is necessary, the additional costs of pivot 
searches and swapping rows and/or columns are avoided. Block and parallel algo- 


rithms can exploit this certainty. Consider the recursive block version: given 


+= [ales 

Az |A22 Ly [Loo ea hy 

so Ay; = L\,L{, (Cholesky factorization of a block). Then A} = LoL, so Lo = 
AnLj/, and we have Ay — LL}, = LL}, (recursive Cholesky factorization). 
By choosing Aj, to be the top-left b x b submatrix of A, we can perform a b x b 
Cholesky factorization using the standard algorithm, and then use matrix—matrix 
operations. The matrix fee can be explicitly computed in O(b*) flops. We can 


thereby leverage the power of BLAS-3 operations (see Section 1.1.7). 
By comparison, LU factorization with partial pivoting must work with the entire 


block column I rather than just A,;, to compute Lj; and U,,. This means 


ul 
Aa 
that a much larger amount of data has to be processed, which might not fit in cache 
memory. 


2.1.5.3 LDL" Factorization and BK Factorization 


Both LDL’ and BK (Bunch—Kaufman) [39] factorizations involve having a diag- 
onal or block-diagonal matrix D, and L matrices that have one’s on the diagonal. 
The LDL’ factorization of a symmetric positive-definite matrix is equivalent to the 
Cholesky factorization, as then the diagonal matrix D must have positive diagonal 
entries, and 


A=LDL? =(LD')(LD"”)', 
(D\ix = Vdee and (D"?)xe =O ifk #2. 


If A is positive definite, the advantage of the LDL" factorization is avoiding com- 
puting square roots. Square roots take roughly 10-20 times as long to compute as 
addition, subtraction, or multiplication in modern architectures, so this could improve 
performance. On the other hand, there are only n square root computations in com- 
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puting the Cholesky factorization of an n x n matrix compared with ~ sn other 
floating point operations. The cost of these n square roots is small compared to the 
other floating point operations in Cholesky factorization for n > 10. 

To see how the LDL’ factorization works, consider the recursive decomposition 


[seal a 


From this, we have the equations 


dl=a, sol=a/d=a/a 


A—ate™ = LDL’ (recursive LDL’ factorization). 


This algorithm can work for some matrices that are not positive definite or positive 
semi-definite, such as 


Indeed, this approach will fail for any symmetric matrix where there is a zero subma- 
trix in the top-left corner. To handle this, we need to incorporate symmetric row and 
column pivoting. We also need to allow 2 x 2 diagonal blocks. This is the approach 
of the Bunch—Kaufman (BK) factorization. 

The BK factorization [39] of a matrix A is 


PAP’ =LDL’, 


where P is a permutation matrix, L is a lower triangular matrix with ones on the 
diagonal, and D is a block-diagonal matrix with diagonal blocks that are either 1 x 1 
or2 x 2. The BK algorithm is essentially a block LDL’ factorization combined with 
a special symmetric pivoting strategy. The BK pivoting strategy at stage k involves 
scanning columns k and k + | and checking the diagonal entries as well. There is still 
the possibility of exponential growth in the entries of the factored matrices, although 
like the possible exponential growth in entries for partial pivoting, it is rare. 
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Algorithm 13 Block LU factorization 
1 function blockLU(A) 


2 for ha 1,2) ip 

3 Axk = LekUke // LU fact’n 

4 for i=k+1,...,p 

5 Lik <— AixUy’ // forward subst'n 
6 Uni <— Lj Ai // forward subst’n 
7 for j=k+l,...,p 

8 Aij = Aij = Lik Ukj 

9 end for 

10 end for 

11 end for 

12 return ([Lij],[Uij]) 


13. end function 


2.1.5.4 Block Factorizations 


Block factorizations can exploit the advantages of matrix—matrix operations on mod- 
ern computer architectures. Computing the matrix—matrix product AB where A and 
B are n x n matrices using the standard algorithm takes O(n) floating point oper- 
ations, but involves only O(n”) data transfers. If both A and B and the product can 
fit in cache simultaneously, the number of computations per data transfer is O(n). 
Since in modern architectures, the time needed for transferring just one floating point 
number can be used for many arithmetic operations, matrix—matrix multiplications 
can be performed very efficiently. 

In comparison, computing matrix—vector products Ax takes O(n?) floating point 
operations and O(n”) data transfers. This means that no matter how large n is, data 
transfers will take a significant amount of time. With the high cost of data transfers, 
most of the time spent in computing Ax will be in transferring data, rather than 
performing numerical operations. 

LAPACK [5] makes use of block matrix—matrix operations to increase efficiency. 

To see how to use them, consider computing the LU factorization of a block matrix 
A without pivoting: 


Ay, Aj2 ++ Alp Li Uy Uj2 +--+ Up 

Az A +++ Aap Ly Lo log +s sy 
A=]|. .. . |=]... . . | =u. 

Apt Ap2 +++ App Lyi Lpr +++ Lop Upp 


Each block matrix Aj; is b x b or smaller; the number of block rows and columns 
is p = [n/b]. We choose b so that two b x b matrices can be kept in either the 
level-one or level-two cache. The block LU factorization algorithm is then given by 
Algorithm 13. 

If b is too large for two b x b matrices to fit in cache, then each block operation 
will overflow the cache. This results in cache misses, and data transfer from main 


84 2 Computing with Matrices and Vectors 


memory or higher level cache. The additional data transfers takes substantial time. 
Keeping D as large as possible but avoiding cache overflow gives an optimal blocking 
version. 

For LU factorization with pivoting, we start by the LU factorization with pivoting 
for the first block column: 


Ai Li 
Ani Ly 
Api Loi 
Then we can perform the block operations U;; <— La j for j =2,..., p and 


Ajj <— Ajj — Li1U1; fori, j = 2,..., p. Then we recursively apply the method to 
[ Aij | ESS 2eioa Bhs 

For block QR factorization of A, like the LU factorization with pivoting, we apply 
the QR factorization to the first block column of A. An efficient way of storing the 
Q matrix is the WY form [105, Sec. 5.1.7, pp. 213-215]. 

There are also block-sparse matrices which consist of blocks Aj, for (i, j) ina 
sparse set. Block-sparse matrices spread the cost of navigating the sparse matrix 
data structure across more floating point operations. This also reduces the memory 
overhead compared to ordinary sparse matrices. 


2.1.5.5 Sherman-—Morrison—Woodbury Formula 


Although the Sherman—Morrison—Woodbury formula [116] is not a factorization, it 
can be very helpful in solving systems of linear equations, especially for solving a 
sequence of linear equations where the matrix changes by a low-rank matrix. The idea 
is that if we know how to solve Ax = b for x given any b, we can efficiently solve 
(A+ uv’)z = y for z given any y. We say that A + wv’ is a rank-1 modification 
of A since uv’ is arank-1 matrix. 

The basic Sherman—Morrison formula is 


A-luv™ A! 


2.1.16 A PM A ee 
( )  (A+uv’) iat 


provided v’ A7'u # —1. 

To see why this is so, suppose that (A + uv?)~! = A~! + rs’. Then 
(A+uv’)(A! +rs’) =I. 

Expanding gives 


AA !+Ars? +uv?A7! +uv' rs? = 1. 
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Since A A~! = J, subtracting this from both sides gives 
(Ar +uv'r)s? = —u(v' A). 


Put s’ = —v’A7!. Then we have to solve Ar =u—uv'r = (1—v'r)u. So 
r=(1-v'r)A'u. Let y= 1-07 r. Then r = yA7!u and y= 1— 07 yA7!u; 
solving for y gives y = 1/(1+v’A7!u) provided the denominator is not zero. 
Combining the results for r and s gives 


A-luv’ A! 


A T ot. aot 
rae} 1l+v7A-u 


as we wanted. 

To use this for solving (A + uv’)z = y efficiently, we can first compute w = 
Alu, s? =—v" A~! andr = w/(1 + v? w) provided the denominator is not zero. 
Assuming we have the LU factorization of A, this can be computed in O(n”) flops. 
The solution to our rank-1 modified equations is 


2 Aly + r(s’y). 


If we have the LU factorization of A, we can compute A~! y in O(n?) flops. 

Numerical difficulties can occur if the denominator 1 + v’ A~!w is small. This is 
likely at some point in chains of rank-1 modifications A + se uj vi» | a — ee ee rere 
which occur in some applications. In that case, a chain of rank-1 modifications of the 
inverse can be created. But to avoid numerical instability, the values of | rj | , ||s j | 5 
should be monitored; if they become large, the chain should be terminated, and a 
fresh LU factorization of A + a Uj vi should be computed. 

A generalization to rank-r modifications of a matrix is the Sherman—Morrison— 
Woodbury formula: 


(2.1.17) (AU SA =A Ore VA 


provided the matrix J + V7 A~!U is invertible. If U and V aren x r matrices with 
linearly independent columns, then UV" is a rank-r matrix, and A+ UV" is a 
rank-r modification of A. 


Exercises. 


(1) Let H,, be the n x n Hilbert matrix: (H,)i; = 1/@ + j — 1). Given n, gener- 
ate x as a random vector in R” and set b < H,x. Solve H,x = b numer- 
ically. Compute ||x — Xl, /||x||, and ||H,X — bl|, / ||b\|,. Do this for n = 
6, 8, 10, ..., 20. Report how the relative errors change as n increases. Compare 
their growth with k2(H,). 

(2) An n xn matrix is strictly diagonally dominant if |aj;| > )° jisi |ai;| for 
all i. Show that strictly row diagonally dominant matrices are invertible. 
[Hint: Write A = D + F where D is the diagonal part of A, so that fi; = ajj 
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if i # j and f;; = 0. Then note that A= D(J + D~'F) and |D'F||. = 
max; bare lai; ) / |aii| < 1.] Give an example of a row diagonally domi- 
nant matrix (but not strictly dominant) that is not invertible: |a;;| > }~ pixi |ai ; | 
for alli. 
(3) The inverse B of a matrix A can be computed from its LU factorization: solve 
LUb; =e;, j =1,2,...,n for b;, the jth column of B. Here e; is the jth 
standard basis vector: (e;); = 1 if i = j and zero otherwise. Determine how 
many floating point operations are needed for this computation. See if you can 
improve these counts by using the fact that (e;); = 0 fori < j. 
Develop a block LU factorization algorithm without pivoting by starting with 
the decomposition 


Au {Ai se) Li| Uy |U2 
Ao1| Azz Ly |L22 [U2 |’ 
with A,; ab x b matrix. Use the standard LU factorization algorithm for fac- 


toring Ail = LU ,,. 
Modify the block LU factorization of the previous question to include partial 


(4 


wm 


(5 


wm 


11 


21 
is necessary. Use the standard LU factorization algorithm with partial pivoting 
A 
for | |. 
Adi 
Tail-end recursion is where the final statement in a function is a recursive call: 
Re-write this as an equivalent while loop. Note that the “x” here may actually 
be a very complex object, such as a partial matrix factorization. 


(7) Re-write the code below using a pair of while loops with no recursion: 


pivoting. Note that A;; might not be invertible, so that swapping rows in ; 


(6 


i 


function g(x,n) 

if n=0: return x 

else: return k(g(h(x), n — 1)) 
end function 


[Hint: First try computing what the return value should be for n = 0, 1, 2, 3, 


and then generalize. ] 
An {A 
I | Ud 


(8) Using a block LU factorization of A, 
fn ee? | _ I 
Az | Az Ly 
show that A is invertible if A,; and the Schur complement S := Az — 
A> Aj, Ai are both invertible. 
(9) In question 7, show that det A = det Aj; det S. 

(10) In question 7, show that if A is symmetric, so is S. If, in addition, A is positive 
definite, show that S is also positive definite. [Hint: Try creating a symmetric 
“L DL*” version of the block factorization in Exercise 8.] 
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(11) Show that for any positive semi-definite matrix A there is a Cholesky factoriza- 
tion A = LL’ with L lower triangular. [Hint: For any a > 0, A+al = tale 
by the standard Cholesky factorization. Show from the bounds on the entries of 
L, that the L,, matrices belong to a closed and bounded subset of R”*”; there- 
fore, there is a convergent subsequence with limit L satisfying A = LL’.] 

(12) Show that the LDL’ factorization can be numerically unstable even when it 


succeeds, by considering the matrix i | withhO #ex~0 


(13) Show that the LU factorization with partial pivoting is invariant under column 
scaling, that is, if D is diagonal and PA = LU is the LU factorization of A 
with partial pivoting, then P(AD) = L(UD) is the LU factorization of AD 
with partial pivoting. 

(14) Show that for any matrix norm and n x n matrices A and B, the condition 
number K(AB) < K(A) K(B). 

(15) Show that if D is diagonal and P a permutation matrix, then D:= PDP’ is 
also diagonal. Then show that if PA = LU is the LU factorization of A with 
partial pivoting, then P(DA) = (DL)U is a factorization of the row- scaled 
matrix DA with row swaps. Give an example where P(DA) = (DLD- DU 
is not the LU factorization of DA with partial pivoting as some entry of DL Do 
has absolute value greater than one. (We need to have the extra factor of D" 
so that the diagonal entries of DLD~' are one.) 

(16) Sylvester’s criterion gives a way of identifying symmetric positive-definite 
matrices. However, replacing “>” with “>” in inequalities (2.1.12) does not 
give a proper test for symmetric positive semi-definite matrices. Show this with 


._ |0 0 
the example matrix 0-1 |: 


(17) The Sherman—Morrison formula (A + uv?)~! = A7! — Av!uv™A7!/(1 + 
v’ A~'n) gives a way of inverting a rank-1 update of an chia matrix. 
Show that we can write it in compact form (A + uv’)~! = A~! + rs". Imple- 
ment this as a function that, given a factorization of A (the LU factorization 
with partial pivoting, for example) and vectors wu and v, returns the pair r and 
s. Give an operation count for your method. 

(18) Extend the idea of question 17 so that given A, a factorization of A, 


and (u;,v;),i = 1, oH ,m, you cela (r;,8;),i =1,2,...,m, so that 
(A+ Oh, ui; = = = AM 14 $0", ris?. Give an operation count for your 
method. 


2.2 Least Squares Problems 


Least squares problems are commonly used in statistics for fitting data to a model. 
For example, we might want to fit a set of data points (x;, y;),i = 1,2,...,n, toa 
straight line, or a quadratic g(x) = ax? ++ bx +e: yi © G(X) = wo + bx; +c. The 
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“best” fit is typically taken to be the one that minimizes the sum of the squares of 
the errors: 7", e? = )-"_, (q(x) — yi)? over all possible choices of the unknown 
coefficients a, b, and c. 

The conditions for solving least squares problems can be related to orthogonal 
complements: the orthogonal complement of a vector space V C R” is the set 


(2.2.1) Vi :={y|v’y=0forallveV}. 


2.2.1 The Normal Equations 


If we want to fit a quadratic function g(x) = ax’ + bx +c to a data set (x;, yi), 
i =1,2,...,n, we typically aim to minimize )~7_, e? where e; = y; — q(x;). Here 
the unknowns are the coefficients a, b, c. Setting ¢ = [a, b, c]", we see that e; = 


y, — U1, xj,x7]e and so 
1 x, i yy 
1 x2 % A y2 
e=| 13 x5 b{|—| 3 ]=Ac—y. 
. . . Cc 
1 x, ae Yn 


That is, we want to minimize 


n 
2 T 2 2 
Se? =e7e = lle} = ||Ac — yll3. 


i=l 

If c* minimizes y(c) := ||Ac — y II3. then we have the first-order necessary condition 
d 2k 
— (ce + sd)| _9 =9 — forany d. 
ds i= 


For our objective function 


d d 
70 +34) = = (IIA + sd) — yll3) 
ax ((A(e + sd) — y)’ (A(e+sd) — y)) 
ds 


= 2 (A(e+sd)—y)" . (A(e+ sd) — y) 
O25) =2 (A(e+sd)—y)' Ad. 
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In particular, 


d 


0 = = le" +sd)|,_. =2 (Ae* — y)" Ad = 2(Ad)" (Act — y) 
= 2d" A" (Ac*—y)  foralld. 
That is, 
A’ (Ac*—y)=0, so 
(2.2.3) A’ Act = A’y. 


Equation (2.2.3) is called the normal equations. 

The condition that A? (Ac* — y) = 0 is equivalent to requiring that Ac* — y is 
in the orthogonal complement to range(A): Ac* — y € range(A)+. 

These are necessary conditions, but they are also sufficient: 


d* d 
qi Pe +sd)= 7s2 (A(e+sd)—y)' Ad _ by (2.2.2) 


= 2(Ad)! (Ad) = 2||Ad||5 > 0. 


Using Taylor series with second-order remainder (1.6.1): 


d : a 
y(c* + td) = y(c*) + — v(c* +sd)| g¢t / (t —s)s (e+ sd) ds 
ds i= 0 ds? 


> y(c*) +0-t = v(c*) 


for any ¢ or d. Thus, the normal equations are sufficient as well as necessary. 


2.2.1.1 An Example: Fitting a Quadratic 


As an example, suppose we wish to find a least squares quadratic fit to the data 
in Table 2.2.1. This data was generated in MATLAB from the quadratic function 
q(x) = x? — 3x + 1 for0 < x < 2 with pseudo-random noise added. This data leads 
to the linear system 


20.0000 19.4807 26.3727 a —12.8015 
19.4807 26.3727 40.5860 b | =} —19.7733 with solution 
26.3727 40.5860 66.5787 c —29.6592 

a +1.0113 

b | =} —3.1477 


c +1.0727 
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Table 2.2.1 Data for quadratic fit 


x; | 0.9569 0.6410 1.2032 1.8263 1.3650 1.8935 0.1982 1.0221 0.2203 1.0905 
yj | —1.0774 | —0.5767 | —1.2611 | —1.2107 | -1.3714 | —1.0651 | 0.4468 —1.0731 | 0.3227 —1.1482 
x; | 1.3776 0.2948 1.5551 0.7981 1.7966 0.6141 0.1221 0.4389 0.1657 1.9008 
y; | —1.2489 | 0.1013 —1.2266 | —0.7618 | —1.1791 | —0.4325 | 0.6659 —0.1414 | 0.5320 —1.0960 
1.5 T 
* (xi, yi) 
; ---q(z) 
—— 2) 
0.57 
> Or 
-0.5 7 
-1+ 
-1 5 L L L 
0 0.5 1 1.5 2 


Fig. 2.2.1 Quadratic fit for data in Table 2.2.1 


The solution gives the quadratic fit (x) compared to q(x) used to create the data, 
as shown in Figure 2.2.1. 
The matrix A’ A in the normal equations has a number of important properties: 


e A’ A is symmetric: (A? A)’ = ATATT = ATA. 
e A’ A is positive semi-definite: z" A’ Az = (Az)" (Az) = ||Az||3 = 0 for any z. 


Furthermore, if the columns of A are linearly independent, then for z 4 0, Az 4 Oas 
well, andsoz! A? Az = ||Az||5 > 0. Thatis, A’ A is positive definite. Thus, provided 
the columns of A are linearly independent, A’ A is invertible. 


2.2.1.2 Perturbations, Conditioning, etc. 


To see how sensitive the least squares problem is to perturbations in the data, consider 
the problem of minimizing ||(A + E)¥ — (y + Z)||, over ¥. We would like to have 
some analogy to the condition number for rectangular matrices. Since least squares 
problems are tied to the 2-norm, we generalize &2(A) to rectangular matrices. If A 
has linearly independent columns, then A’ A is invertible, and the solution of the 
least squares problem is x = (A? A)~!A’ y. The n x m matrix 
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(2.2.4) At = (A™A)!A™ 


is the pseudo-inverse of A, provided the columns of A are linearly independent. For 
solving a system of linear equations Bx = b, the solution is given by x = B~'b, so 
the condition number of B is K(B) = ||B]| | Bo! | using the appropriate matrix norm; 
the solution of the least squares problem min, ||Ax — b||, is given by x = Atb so 
we define the least squares condition number 


(2.2.5) K2(A) = [lAll2 |A* |] - 
If A is square with linearly independent columns, then A is invertible, and 
At =(A™A)!Al = ANATAT = AM, 


That is, for square invertible matrices, the pseudo-inverse is the ordinary inverse, 
and the least squares condition number is the ordinary condition number. The least 
squares condition number also satisfies 


K2(A) = ||All2 | AT], = J At Al, = [47471474], = Ilo = 1. 


Lemma 2.8 Jf A has linearly independent columns and || E||, | At | > < lthenA+ 
E also has linearly independent columns. 


This generalizes a result that comes out of the proof of the perturbation theorem 
for linear systems (Theorem 2.1): if A is invertible and || £|| | AT! | <1,thenA+E 
is invertible. 


Proof If A has linearly independent columns, then the solution of the least squares 
problem min, || Ax — b||, is x = Ab. Thus the solution of ming ||(A + £)¥ — yll, 
is the solution of ming ||A¥ — (y — Ex)||, where x = X. Thatis,¥ = AT(y — EX), 
or equivalently (J + ATE) = y. This is invertible if ||A*E||, < | At||, Ell < 
1. Thus ming ||(A + £)X¥ — y||, has a unique solution under these conditions, and 
therefore A + E has linearly independent columns. 


Theorem 2.9 (Perturbation theorem for least squares) Suppose that 


x minimizes ||Ax — b\|,, and 
X minimizes ||(A + E)x — (b+ 4)||>, 


where A ism x n with linearly independent columns. If 


E 1 Ax — 
<=max|! J Mat dt Gage Oe oy, 
IAI’ [bl J (A) [bll> 


with b 4 0, then 
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x — 2 A 

Pe lla | AAD I tan 8 xp(A)?| +0). 
lx Ilo cos 0 

Furthermore, if r = Ax — b and¥ = Ax — b, then 


IF — rll 


rl <€(1+24(A)) min(1,m—n)+ O(e). 
2: 


Before we give the proof, we note that if B is a symmetric positive-definite matrix, 
there is a symmetric positive-definite matrix C where C? = B. We write C = B'/”. 
This is because there is an orthogonal matrix Q where 
(2.2.6) 


1/2 
x a 
d2 
| 


B= 0! 3 ea so we can set C = Q? 
| y | I “ip 


We say that C is the symmetric positive-definite square root of B. 


1/2 
A3 


Lemma 2.10 Jf A has linearly independent column then | At |; = | (AT A)! |. 


Proof First note that AtA = (A’A)~'!A? A =T. On the other hand, A At = 
A(A! A)~!A? isanm x m matrix, which cannot be the identity matrix if m > n: the 
rank of A A* cannot be more than the rank of (A? A)~! which is n. However, P = 
A At isaprojection: P? = (A A+)(A At) = A(A*+A)At = AIT At =AATt=P 
Furthermore, 


T = (ACA? Ayo tat ye = A™T(ATA)!AT = A(A? A)~ 1A? =p 


since A’ A and (A? A)~! are symmetric. Since P is symmetric, it has real eigenval- 
ues with a basis of orthogonal eigenvectors. Since P? = P, every eigenvalue of P 
satisfies \? = X, that is, \ = 1 or A = 0. 

We show that || A* |; = |(A7A)~!],: 


a atz|; z! A(A! A)-1(ATA)~ ees 
= max 
7 “Yziz exo as 


Jat] = 


But P = A(A’ A)~'A? is a symmetric projection matrix. In fact, it is the orthogonal 
projection onto range(A). To see this, suppose z € range(A); write z = Aw. Then 
Pz = A(A’ A)“!A" (Aw) = A(A’A)~!(A? A)w = Aw = z. On the other hand, if 
z is orthogonal to range(A), then z is orthogonal to every column of A and so 
A’z =(z" A)’ =0; thus Pz = A(A’A)~'(A7z) = 0 
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Write z= u-+v where uw € range(A) and v is orthogonal to range(A). Then 
ziz=u'utv'v; also z! Pz = (ut+ v) u = uu, and A’z = A’u. Then 


(u+v)" A(A’ A)~7A! (u + 0) 
u40,v£0 (u+v)?(u+v) 
ul A(AT A)-2ATu 
Ren ulut+olv 
u’A(A'A)?ATu uu 


lat] = 


u40,v40 ulu ulutovlv 
—_uTA(ATA)2ATH 
~ ux0 ul Pu 


ul A(ATA)~2ATu 
= Sup — TA)-l1AT 
ny UT A(ATA)—1ATH 


If we put w = (A? A)~!/7A7u then u’ A(A’ A)7A’u = w' (A’ A)! w. On the 
other hand, w? w = u? A(A? A)~!A‘u. Then 


reas w 


ay? = 
Jat [5 = sup 2 A 


The maximum value of the ratio w’ Bw/w’ w for a symmetric matrix B is the 
maximum eigenvalue of B. If B is also positive definite, then the maximum value is 
|| B||,. So by (2.1.14) and (2.1.15), 


At 5 = Amar((AT AY!) = (ATA) 


as we wanted. 


Now we can continue the proof of the perturbation theorem for least squares, 
Theorem 2.9. 


Proof (Proof o of Theorem 2.9.) This proof follows [105, Sec. 5.3.7, p. 242]. Let 
E= E/e andd = d/e. Since ||E||, < 1/ | >, and so by Lemma 2.8, A + tE has 
linearly independent columns for all 0 < ft < «. Thus 


(Cea (A+1tE)"(A+tE)x(t) = (A+tE)' (b+ td) 
is continuously differentiable for 0 < t < ¢. Note that x = x(0) and ¥ = x(c), so 
d 
=x+e—() +0). 
dt 


Let v = dx/dt(0). Since b 4 0 and sin# ¥ 1 we have x ¥ 0, and 
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[x —xll, __ llvlle 


IIx Ilo IIx Il2 


+ O(e’). 


Differentiating (2.2.7) with respect to ¢ at t = 0 gives 


E’ Ax + ATEx+A™Av=A'd+E'b:  thatis, 
v= (A™A)!A’ d@ — Ex) — (A7A)'E’r. 


Taking norms, we get bounds on v: 
loll, = (ATAY AT], ([dl], + [El Welle) + [ATA], [EI Ue 


Since € = max {||d||./||Bll2, || Ello /I|Allo}, using |, = ||d||, /e < ||bl|,_ and 
IF ll, = |/E|l2. /e < |All gives 


x—x v 
| Ilo ee |v|l> +02) 
lx Ilo Ix|l> 


10 
= {tal l4"I, Gat mht ) 


Illa yale | ata} + Oe). 


[Alle IIx Ilo 


Since A’ (Ax — b) = 0, we see that x7 A’ (Ax — b) = 0, so Ax is perpendicular to 
Ax — b. Then 


|BII5 = |Axl]5 + |Ax — B53 = Axl} + IIrllj and so 
AIS Hell = WAxIZ = 513 — Ir ll. 
But ||rl|, = sin@ ||b|l2, so []bl]5 — Ir ]3 = ]bl|5 cos? @ < |] All} lx ||3. This means 
Bll /CMAll2 Well) < 1/ cos @ and ||r||2 /(|All2 llxll2) < sin 6/ cos 6 = tan 6. Since 


(ATA)! = JATI2 by Lemma 2.10, IE [ATA], = WAIB [ATE = 
k2(A)*. Substituting these above gives 


eo 1 
I le gee le Gay (= ie i) + tan 6 &)(A)2 $+ O(2). 
lx Ilo cos 0 


For the residuals, let = ” 
r(t) = (b+ td) —(A+TtE)x(t). 


Differentiating with respect to ¢ and taking t = 0 give 


si= * 0) = (I — AAt)(@ — Ex) — A(A™A)"'Er. 
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So 

IF — rll IIs|l> 2 

=€ + O(e) 
I|B|| I| Bll 
A 
ee{fi-aarL(4 Ilo IIx Ilo +1) 
I|B || 
Pig Ir ll 2 
+ ||A(A7A) [2 l4lle iB Oe), 
2 


Note that J — A At = 1 — P is a symmetric projection as P = A A* is a sym- 
metric projection: (J — P)*>» = ] -2P + P*? =] -2P + P =I — P. Inthis case, 
P is the orthogonal projection onto range(A); then J — P is the orthogonal pro- 
jection onto the orthogonal complement of range(A). Thus \|7 - AA*|, =0 
if range(A) = R”, and one otherwise. That is, ||/ — AA*|, = min(m —n, 1). 
Note that ||All2 |Ixll, < Allo || At}, < *2(A) llbllo, and [Ir llz = [|b — Axl, = 
|Z — AA*)b]| < min(m — 1, 1) ||b]|2. Then 


IF — rll 


Tb <e{min(m —n, 1)(1+242(A))}+ O}), 
2: 


as we wanted. 


An important aspect to note is that if 7 = 0 so that Ax = 5b, then the bound on 
|| — x||> / ||x ||, is linear in the condition number «2(A), but if ||r||> / ||B||, is not 
small, then the bound grows like kK (A)?. 


2.2.1.3. Cholesky Factorization and Normal Equations 


A common method to solve the normal equations is to use the Cholesky factorization 
of A’ A = LL’ (see Section 2.1.5.2). Provided the columns of A are linearly inde- 
pendent, then A’ A is positive definite and L is invertible. Then J = L~'ATAL-~T = 
(AL-T)’ (AL~T) so O = AL~ satisfies ar O = I. However, this Q is not orthog- 
onal in general, because if A ism x n then Q is m_x n and is not necessarily square. 
However, if 7; is the ith column of Q and since Q? Q = J, then HCE =lifi=j 
and zero if i ~ j. That is, the columns of O are orthonormal. Also, 


A=QL'. 


This is a version of the QR factorization, which we consider in the next section. 

We can try to use the results of the Wilkinson backward error theorem (Theo- 
rem 2.5) and its supporting lemmas (Lemmas 2.3 and 2.4) to provide a backward 
error result for using normal equations and Cholesky factorization. If we use a stan- 
dard matrix multiplication algorithm, then the computed value of A’ A is 
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A(AT A) =(A+E)"A, [El S mlAl, 


where 7, = mu+ O(mu*) and u is unit roundoff by Lemma 2.4. Note that |A| is 
the matrix of absolute values |ai f |. There is no guarantee that fl(A’ A) is symmetric 
or positive definite even if the columns of A are linearly independent. For example, 


consider 
ata [ifien "Tifl+n]) fo. | ttn 
~ Lo] 7 Oo] ny | Ll+n]l+2n4 217 |" 


If 7 is a power of two and u < 7) < +/u, then fl(1 + 27 + 27?) = 1+2n in IEEE 
arithmetic. In this case, det fl(A’ A) = (1 + 2n) — (1 +)? = —7° so fl(A? A) must 
have a negative eigenvalue. 

The roundoff errors incurred by Cholesky factorization can also turn a matrix 
from being positive definite to positive semi-definite or neither positive definite nor 
3 + 2u}3 

3 {3 
Cholesky factorization fails in IEEE double precision as the computed value of 
an — (ar /./an)? is exactly zero. 

However, if the Cholesky factorization succeeds, Theorem 2.5 can be adapted to 
show that the computed solution ¥ satisfies 


positive semi-definite. For example, B = is positive definite, but its 


(ATA+ E)¥ = (A+ F)'b 


with |E| < O(mu) |A|? |A| and |F| < O(mu) |A|’ |b|. The perturbation theorem 
for linear systems (Theorem 2.1) then gives the relative error in the solution 


Ix — xllo 


= O(mu) k2(A! A) = O(mu) k2(A). 
Ix \I2 


The fact that we have the square of the condition number can be cause for concern 
for ill-conditioned problems. The QR factorization of the following section gives 
a way of improving this for situations where the residual is small compared to the 
right-hand side: ||r||, << ||B|l. 


2.2.2. QR Factorization 


The QR factorization of an m x n matrix A is 
(2.2.8) A= QR, Q orthogonal and R upper triangular. 


Note that Q is m x m while R is m x n. If m > n then we can split both Q and R 
consistently: 
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R 
(2.2.9) A=QR=[Q1 | on 0 = QR), 


where Q; is m x n and has orthonormal columns and R; is x n and upper trian- 
gular. This is called the skinny, thin, or reduced QR factorization of A. 

Recall that a matrix Q is orthogonal if Q = Q~', or equivalently, Q is square 
and Q? Q = I. Note that orthogonal matrices preserve the 2-norms of vectors: 


(2.2.10) lOzlls = (Qz)" (Oz) = 27 O' Oz = 2 Iz = 2% = |Iz\3. 
Lemma 2.11 Jf A has linearly independent columns, then Rj is invertible. 


Proof If A has linearly independent columns, then Ax = 0 implies x = 0; then 
Q,R\x = 0 implies x = 0. But Q; has orthonormal columns so Q, has linearly 
independent columns; therefore, Q; Rix = 0 implies R;x = 0. Then the only way 
that Rix = 0 is if x = 0. Since R is square, this means that R, is invertible. That 
is, if A has linearly independent columns, then R is invertible. 


If A is complex, then the QR factorization of A is A= QR where Q is unitary 
(Q!'= 0') and R is upper triangular. 


2.2.2.1 Using QR Factorizations 
QR factorizations are most frequently used to solve least squares problems. Consider 
the problem 


min || Ax — b||, 
x 


with A anm Xn matrix with m > n. Then 


|| Ax — bllp = |ORx — bl, = | O(Rx — O"D)|, 
=||Rx-—Q7b|, by (2.2.10) 


—_ i Ri QT 
(Lo ]>-[ot 4, 


= Rix — ofa)? + OF]. 


Assuming that A has linearly independent columns, so that R; is invertible by 
Lemma 2.11, we minimize || Ax — b||, by setting Rix — Ob = 0. That is, we set 
x = R;'Q7b, and 


min |Ax — bli, = | OF ||, = V/I1b|3 — | OT). 
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Note that solving the least squares problem only requires the reduced QR factoriza- 
tion. 
There are some useful properties of these matrices: since Q' = Q7', 


7 _ gt _ oTO O'O, _ IT0 
1= 070 =| 54 ][o. o1=| 815" oho: |=[o7| 


and so Qi Q1 = 1, Q5 Q> = I while Qi Q> = 0. Also, 


T 

1=00" =[Q: 2] | S| = 0:0} + 020). 

Lemma 2.12 Suppose A has linearly independent columns. Then the matrix Q, Q{ 
is the orthogonal projection onto range(A), while QQ} is the orthogonal projection 
onto range(A)*. 


Proof Suppose z € range(A), so that z = Au for some unique uw. Then || Ax — z||, 
is minimized by x = u = R;'Q1z. That is, Riu = Q7z. Pre-multiplying by Q; 
gives Q}O'z = Q|R\u = Au =z. 

Now suppose that z is orthogonal to range(A). Then A’ z = 0. Substituting using 
the reduced QR factorization gives R{ Q{z = 0. Since A has linearly independent 
columns, Rj is invertible, and so Q[z = 0. Therefore, Q; Oz = Q,0=0. 

Regarding Q2Q5, we note that 0.03 =I — Q\Q{ so Q2Q} is the comple- 
mentary projection to Q;Q7. If z € range(A) then Q2.07z= (J - Qi Qf)z= 
z—-Q:0{z=z-—z=0 while if z is orthogonal to range(A), Q2.Q3z= 


Gd —Q:07)z=z-Q,01z=z-0=xz. 


2.2.2.2. Gram-Schmidt Orthogonalization 


A form of the QR factorization was developed with the work of Erhard Schmidt [230] 
(1907) citing the method of Jorgen Pedersen Gram [108] (1883), although Laplace 
had already used this idea by 1820. (See Laplace’s Supplement Sur l’application du 
calcul des probabilités a la philosophie naturelle to [155]. Translation and modern 
interpretation can be found in [153].) 


Given a sequence of vectors a;, a2, a3, ..., ax that are linearly independent, we 
wish to produce a sequence q;, Go, 93, ---» @,% of vectors that are either orthogonal 
or orthonormal, and span {ay, saohes a;} = span {q1. ee q;} for j =1,2,...,k. 


We will focus on making the q ;’s orthonormal. 
The Gram-—Schmidt process is shown in Algorithm 14. 
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Algorithm 14 Gram—Schmidt process 

1 function gramschmidt(a,, a2, ..., Ax) 
2 for j=1,2,...,k 
3 for i=1,2,...,j—1: rij — aig; end 
4 bj aj — ia ry; 
5 rji — bj, 
6 
7 
8 
9 


qj — 4j/rij 
end for 
return (41,42,---5 4x) 
end function 


Theorem 2.13 The Gram—Schmidt process produces a sequence q,, Qo, +--+ 
where L, fi=j 
> Yl=Y], ae 
g.,= ori, j =1,2...,k, 
ta \0, gies, 
lg:],=1 fori=1,2,...,k, and 
span {ay, er a;} = span {q,, bites q;} for j =1,2,...,k. 


Proof We proceed by induction on k. The base case for k = 1 just requires that 
la, I, = |, which is guaranteed by lines 5 and 6. 

Suppose that the statement is true for k = p. We show that the statement is true 
fork=p+l. 

To see that Algorithm 14 does indeed produce an orthonormal basis, we note that 


P 
Dp4i = Appi — Sai aid: from lines 3 and 4. 


i=1 


Then for £ < p+ 1, 


P 
1D pn = 90 Gps — Y(@49)91 Gi 
i=l] 


T = T 1, iff=i 

? ’ 
= Gp Api — ) 419i : : 
We pti / p+id ) 0, if€ Ai 


gt T = 
— qe a p+l 2 A410 = 0. 


That is, b, 41 is orthogonal to span {a1, a,..., ay}. Since Gp = Bysi/ [p41 Ha 
we see that VT p41 has unit length and is orthogonal to span {a., a2,..., a,} = 


span {q1. On qp}: thus {q. Ga5--+> Ips Gait is an orthonormal set. Fur- 
thermore, 
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P 
V+ = (+0 _ Dejan / Japa 2? and 
i=l 
Dp 
(2.2.11) Ayr = [Opsil> dpa + > pug: 
i=l 
so span {a1, a2,..., Apri} = Span {41 Y GI p+i}: 


Thus, by the principle of induction, the conclusions we want hold for k = 
1,25 3).2345 


Suppose that a; is column j of an m x n matrix A. Then Equation (2.2.11) means 
that 


aig, 
ala, 
4) =[91, 92. 93,°°° Vl | 4193 
il. 
aig, 
ai 
(2.2.12) = (91. 920° Vj Vas Und | By, 
0 
0 


The column in (2.2.12) is the jth column of an upper triangular matrix we call R;. 
Putting the columns together 


A= [a,, AQ, ..-, a,| = [41 da, ***s qn | Ry = QO; Ri. 
This is not the full QR factorization, but rather the reduced QR factorization: Q is 
m x n while R; isn x n. As noted above, Q, has orthonormal columns rather than 
being an orthogonal matrix. 


2.2.2.3 Modified Versus Classical Gram—-Schmidt 


Algorithm 14 is not the only way to orthonormalize a family of vectors. In fact, since 


e-1 
qi (aj — > 5a) 4:)41) = 9/4), 
i=l 
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Algorithm 15 Modified Gram—Schmidt process 


1 function mgs(a, a2, ..., ax) 
2 for j=1,2,..., k 

3 bj <a; 

4 for t= 1,2).555 F heotl | 

5 rij — BiG; 

6 bj — bj - 14; 

7 end for 

8 rij — bill, 

9 q;, — 5;/T jj 


10 end for 
11 return (q1,49,---; qk) 
12 end function 


we can change Algorithm 14 to Algorithm 15. 

It is easy to confuse the modified with the original (or classical) Gram—Schmidt 
algorithms. In fact, Laplace used the modified process. While the classical Gram— 
Schmidt (CGS) and modified Gram—Schmidt (MGS) are identical in exact arithmetic, 
they behave differently in the presence of roundoff error. 

The different behavior between CGS and MGS is discussed in [16], where 
they show that the Rcgs matrix computed by CGS is the exact Cholesky factor 
of a perturbed normal equations matrix A’ A + E where ||E||, < O(mn’)u |All5. 
Further, the resulting loss of orthogonality is bounded by [7 - ObGsQcc sll, a 
O(mn?)u k2(A)?. MGS, on the other hand, as noted in [105, Sec 5.2.9, p. 232] has 
|! — Orcs Qmas|], = Onn?)u (A) (see [23]). 

According to [16], the loss of orthogonality can be remedied by repeat- 
ing the CGS algorithm one more time. That is, if CGS gives A ~ Oo ind 
then by computing the QR factorization of Oo” s by CGS once again (oF s* 
OF gk : c s) We get a matrix Q\. g that is much closer to having orthogonal columns: 


[1 - O25 085 |, = Onn yu. 

This result is of more than theoretical interest. While other algorithms for QR 
factorization are available, MGS and CGS are more suitable for “online” algo- 
rithms where each vector a; is added when it is available. When a new vector aj; 
is presented, MGS is inherently a sequential algorithm while CGS can schedule 
the dot products rj; = aig ; for i < j in parallel, and then the linear combination 


bj — bj - ei rijq, can also be computed in parallel using O(m log j) time. A 
“CGS?” algorithm can be used to obtain a good orthonormal basis efficiently in par- 
allel. However, care should be taken with the R matrix computed by CGS if A is 


ill-conditioned. 


2.2.2.4 Householder Reflectors 


In 1958, Alston Householder discovered a way of carrying out QR factorizations 
with high accuracy [182]. At the time, it was presented as an improvement of Givens’ 
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rotation approach, which is discussed in the next section. We present this through 


a recursive algorithm for the QR factorization based on Householder reflectors. A 
Householder reflector is a matrix 


(2.2.13) PSf=3— 


The matrix P is both symmetric and orthogonal. Symmetry is clear from (2.2.13), 
but to show orthogonality we need some calculations: 


_ P vy? vy? 
P’p =P? =(1-2--) (7-2-9 
viv viv 


i yee get vel 
~ vl y vty viv 
vol v(v? vv 
~ vl’ v (vT v)2 
er any 
~ vl y vv 
The main step we need is, given a vector a € R" to find a v so that the Householder 
reflector P satisfies Pa = ae, where e; = [1,0,0,..., 0]. Since P is orthogonal, 
|a| = |la||2. In real arithmetic, we have a = + |la|l2. If 
vol 
ae; = Pa=(|I- 2 a 
vfv 
T 
va 
=a—2—v 
vl yp 


we let y = 2 (v7 a)/(v" v) and so 
ae; =a—yv. 
Assuming 7 4 O we can re-arrange this to 
yu=a- ae. 


Note, however, that scaling v +> pv does not change P. So we can arbitrarily set 
y = 1. In real arithmetic, we then have 


(2.2.14) v=at|lall,e:. 
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Algorithm 16 Householder QR factorization (recursive, overwrites A) 

1 function hhQR(A) 

2 a<ti1st column of A 

3 compute v via (2.2.15) 
4 B<—2/(v7 v) 
5 
6 


A<PA 


alb? 
let A= = 
O| A 
7 if A has one column 
8 return [6p 
9 end if 


10 (0,R) —hhOR(A) 

1 alb? 
11 o~p| |’ Re z 
12 return (Q, R) 
13 end function 


Once we have this formula, we just need to choose the sign. Since subtracting nearly 
equal quantities reduces relative accuracy, we choose the sign to be the sign of a; so 
that this first component is not subtracted. That is, 


(23,15) v =a + sign(a;) |la||, e1. 


For the efficient use of P, we can pre-compute 3 := 2/(v"v), so P =I — Buv". 
Computing v and 3 given a takes 2n +5 flops. Computing Px = x — B(v' x)v 
given x, v and (3 takes 4n + 1 flops. 

Usually we avoid storing or computing Q explicitly. Instead we store the v vectors 
and possibly the 3 values for each Householder reflector. Efficient storage schemes 
include the compact WY storage method to represent the Q matrix [105, Sec. 5.1.7, 
pp. 213-215]: Q = 1+ WY’ with both W and Y being n x r matrices. Updating 
QO, = (I — yvv")Q can be done by setting OQ, = 1 + W,Y{ with W, =[W | 0] 
and ¥, =[Y | z] with z= —y(1 + YW’)v. This update takes O(rn) operations. 
Complementary scaling of v and z can be done to prevent the columns of W and Y 
become badly scaled. 

Using Householder reflectors gives a QR factorization where the computed O and 
computed R satisfy 


7-070], =Ocnu), and 
|4- OR, = Ocmw) [|All 
2.2.2.5 Givens’ Rotations 


An alternative approach to computing QR factorizations is to use Givens’ rotations. 
Two-dimensional rotations have the form 
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ak ove = 1, 
5 C 


We can use Givens’ rotations to shape the non-zeros in a vector: 


c-s||x} |a 

soc y| {| O}? 
where a = +,/x? + y? since rotations preserve the 2-norm. So we needs x +cy = 
0. This can be satisfied with 


—y 
s= —., 
Vx? 4+ y? 
XxX 
C= 


Very 


The computation of ,/x? + y? can be done using the hypot function to avoid over- 
flow or underflow. By systematically zeroing out entries in A we can create a QR 
factorization. For this we need to choose indexes (i, 7) where 


Cr+ 8 rowi 


G(i, j;c,s) = 


and is otherwise equal to the identity matrix. Then 


1, ifp=q#Fi,j, 

0, ifp #qand {p,q}N {i, j} = 9, 
GG, j5¢.5)pq = 4c, if @ = pand j = q)or (i =gqand j = p), 
Ss, ifp=jandq =i, 
—s, ifp=iandg=j. 


The QR factorization using Givens’ rotation is shown in Algorithm 17. 

Rather than save the (c, 5) pair for each Givens’ rotation, we can save 6 where 
c =cos@ and s = sin 9, or an equivalent encoding. 

From a computational point of view, Givens’ rotations are fairly expensive as 
each requires the computation of a square root, which takes about 10 times as long 
as a multiplication, addition, or subtraction. Otherwise, it has essentially the same 
performance and error behavior as for Householder QR. 
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Algorithm 17 Givens’ rotation QR factorization (overwrites A) 


1 function givensQR(A) 
£6r 6 = 1,2)6.04n 
for j=it+l,..., m 
2 


2 

3 

4 c+ aii / fap tap; s <— —ay/,faz + a7, 
5 A<G(i,j;c,s5)A; QO<— QG{(i, j;c,s) 
6 end for 
7 end for 
8 return (Q, A) 
9 end function 


Exercises. 


() 


(2) 


(3) 


(4) 


(5) 


Generate N = 20 values x; randomly and uniformly in [0, 1], and set y; = 
Ff (xi) + ae; with e; generated randomly with a standard normal distribution, 
fm= 5x? —x +1, and a= 0.2. Plot the data as a scatterplot (no lines join- 
ing the points). Solve the least squares problem to find the coefficients of the 
best fit quadratic (minimize yy (y; - (ax? + bx; +c))) using either normal 
equations of the QR factorization. Plot g(x) = ax? + bx + c for the best fit as 
well as f (x). How close are f(x) and g(x) over [0, 1]? 

Generate N = 20 vectors x; randomly and uniformly in [0, 1]° C R°, and set 
y= COX: + ae; where cp = [1, 0, —2, 5, —1]" ande; generated by a standard 
normal distribution, and a = 0.2. If X is the data matrix [x,,x2,...,Xyl’, 
solve the least squares min, || X¢ — y||, via either normal equations or the QR 
factorization. Look at ||¢ — co||, to see if the least squares estimate is close to 
the vector generating the data. 

For the data matrix X in Exercise 2, replace X with X + F where F is a per- 
turbation matrix and repeat the computations. To generate F,, set each entry fj; 
to be 10~* times a pseudo-random sample from a standard normal distribution. 
Compare the change in the solution to the estimate given in Theorem 2.9. 
Show that if A is real, square, and invertible, then the QR factorization is 
unique apart from a diagonal scaling by factors of +1. That is, if A = QR, = 
Q2R>2 with Q1, Q2 orthogonal and R;, R2 upper triangular, then there is a 
diagonal matrix D with diagonal entries +1 where Q. = Q, Dand DR, = Rj. 
In particular, show that if the diagonal entries of R; and R> are all positive, 
then QO; = Q» and R; = R>. 

Show partial uniqueness for “thin” QR factorizations of A an m x n real 
matrix with m > n provided A has linearly independent columns. Suppose 
A = Q,R,; = Q2R> where Q, and Q> are m x n with orthonormal columns 
(Q1 QO; = QF Qo =I) and Rj, R2 aren x n upper triangular matrices. Show 
that there is a diagonal matrix D with diagonal entries +1 where Q2 = Q,;D 
and DR = R. 
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(6) Repeat Exercise 5 for a complex m x n matrix A with some modifications: 
A = Q,R, = Q2R> where Q), Q> are m x n and 0; 0; = Os 0% = IT, 
while R,, Ro are upper triangular. Show that there is a diagonal matrix D 
with diagonal entries of the form e! 9 @ € R where OQ) = QO; Dand DR> = Rj. 

(7) Show that if A is areal m x n matrix where A’ A = LL’ (Cholesky factoriza- 
tion) and A = QR (QR factorization) then there is a diagonal matrix D with 
diagonal entries +1 where R = DL’. 

(8) QR factorizations can give orthonormal bases for the range and null space of 
rectangular matrices. Suppose A is areal m x n matrix with linearly indepen- 


dent columns. Let A = [Q1, Q2] a | is the QR factorization of A. Show 


that range(A) = range(Q1) and null(A”) = range(Q>). [Hint: For the second 
part, you can use Theorem 8.6.] 


2.3 Sparse Matrices 


Sparse matrices are matrices where most entries are zero. This can be exploited using 
sparse matrix data structures that are substantially different from the simple arrays 
used for dense or full matrices where every entry is explicitly represented. Sparse 
matrices often arise in practice related to differential equations and networks. 

In this section, we will focus on direct methods rather than iterative methods. 
Sparse matrices can also be used in iterative methods where the matrix is only used 
to compute matrix—vector products. If a sparse matrix is only used for computing 
matrix—vector products, the exact location of the non-zeros is not particularly impor- 
tant. What is important is having an implementation that can efficiently navigate the 
data structure. 

Direct methods, such as LU, Cholesky, and QR factorizations, often result in the 
creation of new non-zero entries that had originally been zero. This is called fill-in 
and is important for understanding the performance of a sparse factorization method. 


2.3.1 Tridiagonal Matrices 


The simplest example of a family of sparse matrices that have practical impact are 
tridiagonal matrices. This is a matrix A where a;; 4 0 implies |i — j| < 1. That is, 
all non-zeros of a tridiagonal matrix occur on the main diagonal (i = /) or the first 
sub-diagonal (i = j + 1) or first superdiagonal (i = j — 1). This is part of a more 
general family of banded matrices of bandwidth b where a;; 4 Oimplies |i — j| < b. 
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Algorithm 18 Tridiagonal LU factorization with no pivoting 
1 function tridiagLU(a, b,c) 

2 uy <a; vy <b; £) <—c,/uy 

3 for k=2,...,n—1 

4 UK <— Ag — Cy Vg-1 

5 ug <— be; Le << cK /ux 

6 

i 

8 

] 


end for 

Un — dn — Ln—-1Vp-1 

return ¢,u,v 
end function 


A general example of a tridiagonal matrix is shown below: 


a\ by 
C1 a bo 
C2 a3 b3 
(2.3.1) ae 
Dn-1 
Cn—-1 4n 


Computing the LU factorization without partial pivoting can be done efficiently: 


a by 1 uy Vy 
ci a2 by é 1 U2 V2 
C2 a4 oe e = 1) 1 U3 : 
Dn-1 : . + Un-1 
Cn—-1 an ln-1 | Un 
leads to the equations 
a, =, bh=vy, cy =e, 
ag=f\vyjtu2, bo=v2, cr. = pu, 
a3=f2+43, b3=0v3, cz = £3U3, 
an i=, 2Un—2 + Un—-1, by 1 = Un-1; Ch i=, 1Um-1, 


An = €n—-1Up—1 + Un. 


Solving these equations gives Algorithm 18. This algorithm can also be written to 
overwrite a with u, b with v, and c with @. 
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Algorithm 19 Tridiagonal solver using LU factorization 


1 function tridiagsolve(£, u,v, y) 
x<y 
Xn <—Xpn/Un 
for k=n=1,...52,1 
Xk << (Ve — URXKE+1)/UK 
end for 
for k=2)...5n 
Xe <— XE — LeXK-1 
i) end for 
10 return x 
11 end function 


OAarNA UH FW DN 


Algorithm 18 requires 3n — 1 flops and 3n — 2 extra memory locations, making 
it an extremely efficient algorithm. Solving a tridiagonal linear system after the 
factorization can also be done very efficiently, as shown in Algorithm 19. 

If we incorporate partial pivoting, then the algorithms will be somewhat more 
complex. However, the sparsity of the matrix limits the possibilities for pivoting. For 
example, the first row could only be swapped with the second row; in general, row k 
could only be swapped with row k + 1. These swaps do result in additional non-zero 
entries. We can show how this happens in an example below; “x” denotes an original 
non-zero, while “+” denotes a new non-zero entry created by the algorithm. 


ok ok kk ok 
* Kk Ok *x* 0 Ox x 
A swap rows | & 2 A elimination step 


* 
x * 
* 
* * 
* 
* 


Kb oR aoe Hf 
0 x 0 x 


* 
* 
+ 
x 
+ 


swap rows 2 & 3 a elimination step 
=> ® => 


x 
* 
* 
x* 
oO 
* 


* 
* * 
* 
* * 


* Ok 


etc. 


The result is an additional second superdiagonal of potential non-zero entries in the 
U matrix. The L matrix in the LU factorization remains a bidiagonal matrix with 
entries on the main diagonal and the first sub-diagonal. 

QR factorizations of tridiagonal matrices can also be computed, and the resulting 
sparsity pattern is essentially the same as for LU with partial pivoting: an additional 
second superdiagonal of non-zero entries is created in the R factor. The Q matrix 
given explicitly is not sparse, but if it is stored in compact form (such as using the WY 
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form, or just storing the Householder vectors) it can be stored as a sparse matrix with 
no additional storage of fill-in except for a single vector of length n if the factored 
matrix ism x n. 

Banded matrices can also be factored in a way that preserves their sparsity pat- 
tern using either LU factorization without pivoting or Cholesky factorization. LU 
factorization with partial pivoting and QR factorization both result in fill-in: if the 
bandwidth of the matrix factored is b then the bandwidth of U in the LU factor- 
ization and the R in the QR factorization is increased by b — 1. See [105, Sec. 4.3, 
pp. 152-160] for more details about factoring banded matrices. 


2.3.2 Data Structures for Sparse Matrices 


General sparse matrices require more flexible data structures than we need for dense 
or full matrices. The notes of G.W. Stewart on sparse matrices [239] provide a good 
introduction to many of the issues in creating sparse factorization codes. The most 
common data structures for sparse matrices are Compressed Sparse Row (CSR) 
and Compressed Sparse Column (CSC) data structures. These data structures are 
essentially mirror images of each other; the CSR representation of a matrix A can 
be considered as the CSC representation of A’, and vice versa. CSC format is used 
by MATLAB and Julia, while Python/NumPy uses both. 


Example 2.14 To illustrate how these data structures work, consider the following 
5 x 5 sparse matrix: 


31 6 
41 
2 3 
1958 


The CSR and CSC representations are shown in Figure 2.3.1. 


The rowptr and colptr arrays hold the indexes of the start of the corre- 
sponding row and column, respectively. The colidx and rowidx entries give the 
column (respectively, row) index for that entry, while the value entries give the 
actual matrix entry. 

Algorithm 20 gives pseudo-code for computing y = Ax where A is given by 
either a CSR or CSC data structure. For simplicity, it is assumed that the final entry 
in both the rowptr and colptr arrays is the number of non-zeros in the matrix 
plus one. 

The CSR and CSC matrix—vector multiplication algorithms are roughly equally 
efficient: the CSR algorithm can hold y; in a register while accesses to the x vector 
may be widely scattered; the CSC algorithm can hold x; in a register while accesses 
to the y vector may be widely scattered. 
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Compressed sparse row (CSR) 


rowptr |}1 3 6 8 10 


v’ a 


code |2 441 2 5)3 441 4/2345 
value |7 2/3 1 6/4 1/2 3)1 9 5 8 


Compressed sparse column (CSC) 


colptr |1 368 12 


vr 


rowide |2 441 2513 5/13 4 5/2 5 
value |3 2)7 1114912 13 5168 


Fig. 2.3.1 CSR and CSC sparse matrix data structures 


Algorithm 20 Computing y <— y + Ax using CSR and CSC data structures 


1 function spmultCSR(A, x, y) 

2 for i=1,2,...,m 

3 for k= A.rowptr;,..., A.rowptr;,; — 1 
4 J < A.colidxy 

5 Yi <— yi + A.valueg - x; 

6 end 

i end 

8 return y 

9 end 

a ran 

1 function spmultCSC(A, x, y) 

2 £6 fr 2 crs m 

3 for k= A.colptr ; seats A.colptr ; -1 
4 i < A.rowidx, 

5 yi <— yi + (A.valueg) - x; 

6 end 

7 end 

8 return y 

9 end 


Other ways of representing a sparse matrix include using a set of (i, j, a;;) triples, 
or using a dictionary or hash table to look-up values a;; from a pair (i, j). From the 
point of view of memory access, CSR and CSC data structures are the most effi- 
cient for most matrix operations. Memory access in the matrix algorithm should be 
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adapted to being along columns for CSC sparse matrices, and along rows for CSR 
sparse matrices. Dictionary or hash table representations usually are not so efficient 
regarding memory access as they require pseudo-random memory accesses. How- 
ever, both the set-of-triples (i, j, a;;) and dictionary representations can be much 
more efficient about inserting new non-zero entries than either CSR or CSC struc- 
tures. An efficient way to create a sparse matrix in CSC or CSR format is to create a 
list or array of (i, j, a;;) triples, sort the triples lexicographically according to either 
the (i, j) or (j,i) pairs according to whether CSR or CSC format is desired. Then 
build the sparse matrix sequentially from the sorted list of triples. Sorting the triples 
ensures that no new entries has to be inserted between already existing entries, only 
appended to the end of the value and rowidx or colidx arrays. 


2.3.3. Graph Models of Sparse Factorization 


In this section, we will focus more on the case of symmetric matrices and Cholesky 
factorization. We can represent the sparsity structure of a matrix by means of a graph 
or network. The nodes or vertices of the graph of an n x n symmetric matrix A are 
simply the integers 1, 2,...,. There is an edge from i to j (denoted i +» j) in the 
graph of A (denoted G4) if and only if a;; 4 0. For a symmetric matrix, since we 
have edges in both directions (i +> j and j +> i)ifa;; = aj; A 0,80 we consider the 
edges as being undirected (denoted i ~ j), although we ignore edges connecting a 
node to itself. The graph consists of both its vertices and edges, denoted G = (V, F) 
where V is the set of vertices and EF is the set of edges. For any vertex x € V of 
G, its set of neighbors is N(x; G) = {y € V | x ~G y} where x ~G y means that 
there is an edge connecting x and y in graph G. 

The graph of a tridiagonal matrix is simply apath: 1 ~2~3~---~n—-—I1~n. 
The graph of ann x n diagonal matrix consists of n vertices and no edges. Ann x n 
full or dense matrix, with all entries non-zero, has a graph called the complete graph 
where for any nodes i # j there is an edge i ~ j. This graph is denoted K,,. We can 
think of K3 as a triangle, and K4 as a square plus both diagonals. 

The ordering of the rows and columns can be of immense importance regarding the 
efficiency of using Cholesky factorization for solving systems of equations. Consider 
the sparsity structure in Figure 2.3.2. The graph on the left is a spokes graph, which 
(depending on the ordering) can represent either of the two arrowhead matrices on 
the right. The first of these matrices where the “tip” of the arrow is at the bottom-right 
corner suffers from no fill-in if we use Cholesky factorization, or LU factorization 
with no pivoting. The second of these matrices, where the “tip” of the arrow is at the 
top-left corner, results in complete fill-in after just the first stage of the factorization. 

We want to use the graph model to understand how to order the rows and columns 
to minimize fill-in. To do this, consider carrying out the first stage of Cholesky 
factorization, but with a specified node in the graph listed as being the “first”. Recall 
the recursive formulation of the Cholesky factorization: 
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(A) Spokes graph (B) Arrowhead matrices 


Fig. 2.3.2. Spokes graph and arrowhead matrix 


fe] Et Par) 


al A £ |A —aa™ /a I 


The graph of A-—aa™ /a with the first node and all the edges from the first node 
removed, but with a number of nodes added coming from aa’ /a. Now (aa™ /a); j= 
ajiaj1/ai, #0 if aj) #0 and aj, #0. Ignoring the possibility of “accidentally” 
zeroing out an entry of the matrix (where a;; = aja; /a), coincidentally), we get 
new edges i ~ j in the graph of A- aa’ /c where 1 ~ i and 1 ~ j. That is, the 
graph of A—aa™ /a is the graph of A with edges i ~ j added wheneveri ~ 1 ~ j 
in the graph of A. 

The task now is: given the graph Gy of ann x n matrix A, we want to choose 
which vertex x of G4 to eliminate. When we eliminate vertex x, we remove x and 
all its edges from G4 and add edges between each pair of neighbors of x in G4. We 
can describe this operation more formally as 


G=(V\ {x}, EWN) ULi~ fli jen), i 4 Jj). 


This gives us the graph of A’ := A—aa™ /a, which is used recursively to compute 
the remainder of the Cholesky factorization. So we want to find an ordering of the 
vertices of G4 so that these elimination operations add as few edges as possible. 

The path graph 1 ~ 2 ~ 3 ~--- ~ n can be eliminated in the natural ordering 
without producing fill-in. At stage k, the graphisk~k+1~k+2~.---~n. 
Eliminating vertex k does not result in any fill-in, because the only neighbor of k is 
k + 1 and there is already anedgek ~k +1. 

There are many graphs with the property that there is a fill-in free elimination 
ordering. These include banded matrices, for example. The graph of a general banded 
n X n matrix with bandwidth b has vertices 1,2,..., andedgesi ~ j if and only if 
|i — j| < b. At the first stage of the factorization, vertex | is connected to vertices 2, 
3, ...,b + 1. Elimination of vertex 1 connects vertices 2, 3,..., b + 1 to each other. 
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Algorithm 21 Cuthill-McKee algorithm 


al function cuthillmckee(G, x) 

2 q <queue() // create empty queue 

3 £< list) // create empty list 

4 add(£,x); add(q,x) // add x to both ¢ and q 
5 while gq not empty 
6 
7 
8 


v <remove(q) // v was first entry in gq 
N<{y|y~vis an edge of G} sorted by deg(y) 
for each ye N in order 


9 add(t, y); add(q, y) 
10 end for 

Ta. end while 

12 return @ 


13. end function 


But they were already connected to each other! So there are no new edges. Repeating 
the process, by the time we get to stage k of the factorization, vertex k is connected 
to verticesk+1,k+2,...,k +. But again, they were already connected to each 
other, so there are again no new edges. Thus, we can complete the Cholesky factor- 
ization, or LU factorization without pivoting, of a banded matrix without incurring 
any additional fill-in. 

We can use these insights to see how to factor an arrowhead matrix by looking 
at the spokes graph shown in Figure 2.3.2. Since every vertex of the spokes graph, 
except the “hub” vertex, is equivalent from the point of view of graph theory, the 
choice of which vertex to eliminate first comes down to whether to eliminate a 
“spoke” vertex or the “hub” vertex. Eliminating the hub vertex immediately results 
in edges between all pairs of the other vertices, which is complete fill-in. However, 
eliminating a spoke vertex does not result in any fill-in since the only neighbor of a 
spoke vertex is the hub vertex. Thus, a no fill-in ordering will put the hub vertex last. 
The ordering of the spoke vertices is arbitrary. 


2.3.3.1 Heuristics: Reverse Cuthill-McKee 


The Cuthill-McKee ordering [66, 101] is a method for reducing the bandwidth of 
a sparse matrix: the bandwidth of A is 1+ max { li-—jll aj; #90 }. Tridiagonal 
matrices, for example, have bandwidth three. An n x n matrix of bandwidth b > 1 
has an LU factorization that also has bandwidth b provided there is no pivoting. This 
is useful in limiting the amount of fill-in as each row of a matrix with bandwidth b has 
no more than b entries in each row or column. The Cuthill—-McKee ordering is based 
on breadth-first search [59] for graphs, a method of traversing a graph. Breadth-first 
search uses a queue, being a data structure where items can be added to the back 
of the queue, which can later be removed from the front of the queue. The order 
of items in the queue is fixed according to the order in which they are added to the 
queue. The Cuthill—McKee algorithm is shown in Algorithm 21. 
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While the Cuthill-McKee ordering works well for minimizing the bandwidth, it 
turns out that reversing the Cuthill-McKee ordering actually is better for reducing fill- 
in and floating point operation counts. The reason for the benefits of using the reverse 
Cuthill-McKee ordering over the original Cuthill-McKee ordering is explained in 
[165]. 

As an example for comparing the forward and reverse Cuthill-McKee orderings, 
consider the spokes graphs as illustrated in Figure 2.3.2, where there is a single hub 
vertex and all other vertices are connected to the hub vertex, but no other vertex or 
edge. It should be noted that no ordering of a spokes graph gives small bandwidth. The 
Cuthill-—McKee ordering starting with a non-hub vertex always has the hub vertex 
as the second vertex. All other non-hub vertices follow. The resulting Cholesky 
factorization using the Cuthill-McKee then completely fills in except for the first 
row and column. On the other hand, in the reverse Cuthill-McKee ordering, the hub 
vertex is the second last vertex. Applying the Cholesky factorization to the re-ordered 
matrix using the reverse Cuthill-McKee ordering for the spokes graph results in no 
fill-in. 

In fact, Liu and Sherman [165] showed that the reverse Cuthill-—McKee ordering 
is never worse than the forward Cuthill-McKee ordering for either the number of 
floating point operations or the amount of fill-in. 


2.3.3.2 Heuristics: Nested Dissection 


The main idea behind nested dissection is to split the graph of the matrix into two 
independent parts with a “separator” set being the “glue” between the parts. The idea 
is illustrated in Figure 2.3.3. This approach is developed in [101], for example. 

This means that the vertices of the original graph G is split V=V,;UWUS 
where Vj, V2, and S are pairwise disjoint. Furthermore, no edge x ~ y in G is 
between vertices x and y where x € V; and y € V) or vice versa. 

The subgraph G, consists of the vertices V; and all edges of G between vertices 
in Vi; subgraph G2 consists of the vertices Vz and all edges of G between vertices 
in V2. 

If we have a symmetric sparse matrix A whose graph G4 can be split in this way, if 
we order the rows and columns of A, ordering the rows (and columns) corresponding 
to V, first, followed by the rows (and columns) corresponding to V2, followed finally 
by the rows (and columns) corresponding to S, then the matrix will have the structure 


We can think of applying Cholesky factorization in a block fashion to this matrix: 
P'™AP =L'L where 
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L= Lo where 
M, M>|Ls 


Eqs = A), (Cholesky factorization) 
1517. = Ao, (Cholesky factorization) 


M,= Bilis (forward substitution) 
M,= Bols. (forward substitution) 


Lee = As — M, Mi — M2My (Cholesky factorization). 


The first two Cholesky factorizations for A; and Az can be done independently. 
Furthermore, we can apply the splitting approach recursively to A; and A> by further 
splitting graphs G; and G2, which would have their own separator sets. This can be 
repeated recursively until the number of vertices is too small for this to be helpful. 
The Cholesky factorization of As — M,M} — M>Mf is typically dense. The reason 
for this is that the graph Gs of As — M; Mj] — M2M3 hasanedgex ~ yforx, y € S 
if and only if there is a path from x to y through either G; or G2, but not both. If 
either G; or Gz is a connected graph (which is usually the case), then Gs will be a 
complete graph and Ls will be dense. 

We can analyze this approach to estimate the number of floating point operations 
needed to compute the sparse Cholesky factorization, and to estimate the fill-in 
needed for the Cholesky factors. To get these bounds, we assume that B, and By are 
dense. We can use the recursive decomposition as follows: 
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Co) level-1 separator 


"| level-2 separator 


t._.-.i level-3 separator 


() level-4 separator 


v 


Fig. 2.3.4 Nested dissection ordering for a 7 x 7 rectangular grid 


flops(G) = flops(G1) + flops(G2) 
+ (fill-in(G1) + fill-in(G2)) | S| 


1 
+ (Vil + [Val ISP + ; ie 


fill-in(G) = fill-in(G,) + fill-in(G2) 
+ (Vil + [V2] + |S) |S]. 


A good splitting has |V;| * |V2| >> |S]. 

To turn these values into concrete estimates, we need to estimate the size of |S]. 
For planar graphs G, we can bound |S| = O(./n) [164]. 

This is particularly easy to implement on, for example, a two-dimensional rect- 
angular grid. For an M x N rectangular grid, we take S to be either a vertical line 
or a horizontal line splitting the grid into two roughly equal parts; whether S is ver- 
tical or horizontal depends on whether M > N or M < N. If M=N then S can 
be either vertical or horizontal. An example of a 7 x 7 rectangular grid is shown in 
Figure 2.3.4. The vertical solid box shows the separator set S at the top level, dashed 
for the separator sets at the next level, then dot-dash, and finally solid again at the 
lowest level. 

Nested dissection often produces near-optimal orderings [1]. For an N x N rect- 
angular grid, the factorization can be performed in O(N*) floating point operations 
and O(N? log N) fill-in. By contrast, the standard ordering by row or by column 
results in O(N*) floating point operations and O(N 3) fill-in. For three dimensions, 
nested dissection still gives good result for direct methods, but the cost increases 
more rapidly in N and iterative methods become increasingly advantageous. 
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2.3.4 Unsymmetric Factorizations 


Graph models can be developed for unsymmetric factorizations, such as LU factor- 
izations, both with and without partial pivoting and QR factorization. Rather than 
using undirected graphs with the set of vertices {1, 2,3, ..., n}, we can use directed 
graphs where there is an edge i +> j if aj; 4 0, or bipartite graphs with vertices 
{71,12,--+5m} U {c1, C2,..., Cn} with edges r; ~ cj; if.a;; A 0. A graph is bipartite 
if the vertex set V = V; U V2 with V; N V2 = Wand every edge is undirected but joins 
a vertex in V; with a vertex in V2. In a bipartite graph representing an unsymmetric 
matrix, V; can be taken to represent the rows, while V2 can be taken to represent the 
columns. 
Here we focus on the QR factorization and the bipartite graph model: 


V ={r1,72,....0%m} U {e1, €2,.-., Cn} 


with edges ri ~A Cj if aj; # 0. 


Let Na(x) ={y |x ~4 y} be the set of neighbors of x in G4. If A = OR, then 
let B = A’ A = R’ Q' OR = R’ R with R upper triangular. But B = A’ A is sym- 
metric and we can use the undirected graph model of B: the vertices of Gg are the 
column vertices of A (Vg = {cj, C2,...,Cn}), and c; ~g c; if b;; A 0. Unless there 
are “accidental” zeros, bj; 4 0 if N(c;) NA. N(c;) 4 Y. We can then use the elimina- 
tion model for Cholesky factorization of B to determine how to order the columns 
of B, and thus to order the columns of A. The rows of A should then be ordered 
so that r, ~,4 cx, to avoid unnecessary fill-in at the diagonal entry (k, k) in the QR 
factorization of A. 

The sparsity structure of the LU factorization of A without pivoting is a subset of 
the sparsity structure of the QR factorization of A. If we allow for all possible row 
swaps in partial pivoting, then the sparsity structure of the union of possible sparsity 
structures of LU factorizations of A is the sparsity structure of the QR factorization. 
This means that the sparsity structure of the QR factorization can be pre-computed, 
and this sparsity structure can be used for LU factorization with partial pivoting, 
without the need for any additional fill-in. 


Exercises. 


(1) Suppose we add a single symmetric entry to the top-right and bottom-left cor- 
ners of a symmetric tridiagonal matrix. What is the graph of the resulting 
matrix? How much fill-in will be generated by Cholesky factorization? Assume 
that the resulting matrix is positive definite so that the Cholesky factorization 
exists. 

(2) The graph of an arrowhead matrix is a spokes graph (Figure 2.3.2). If we 
have a symmetric matrix that is the sum of a tridiagonal matrix and a matrix 
ue; + eu" that fills in the first row and column, show that the corresponding 
graph is a wheel graph: a cycle of n — | vertices together with a “hub vertex” 
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that is connected to every node in the cycle. Show that there is an ordering that 
gives no fill-in for this graph. 

Show that if the graph of a symmetric matrix is a tree (a connected undirected 
graph with no cycles) then the matrix can be re-ordered so that Cholesky fac- 
torization gives no fill-in. 

Show that for a symmetric banded matrix (a;; 4 0 only if |i — j| < b where 
b is the “bandwidth”) the Cholesky factorization can be done with no fill-in 
outside the band. Note that a tridiagonal matrix is the special case of a banded 
matrix with bandwidth b = 1. 

The standard discretization of the Laplacian operator in two dimensions 
is (-V7u)i; = (4u;, ; YWj+1,j Uj—-1,j Uj, j+1 Ui,j-1)/(Ax)?. Create an 
N?* x N? matrix using the natural ordering which assign point (i, j) on the grid 
to index N(@i — 1) + j for 1 <i, j < N. The corresponding graph is the grid 
graph that connects (i, 7) to @ + 1, j) and (i, 7 + 1). Generate these matrices 
for N = 5, 10, 20, 40, 100, 200, 400, 1000. How much fill-in is generated 
by Cholesky factorization? Apply the reverse Cuthill-McKee reordering algo- 
rithm to these matrices. How much fill-in is now generated? 

For the matrices in Exercise 5, use the nested dissection algorithm of 
Section 2.3.3.2 to minimize fill-in. How much fill-in is generated by Cholesky 
factorization? Plot the amount of fill-in against NV. 

Graphs can generate matrices. Specifically, given an undirected graph G with 
vertex set V and edge set E,, we define the graph Laplacian Lg to bea V x V 
matrix where (Lq)j; is the degree of node i in G, with (Lg);; = —lifi~ j 
is an edge in G and (Lg);; = 0 otherwise. Show that: 


(a) the graph of Lg is G; 

(b) Lg is a symmetric positive semi-definite matrix [Hint: z7’L¢z= 
fi Vi jin j Gi _ raha 

(c) the null space of Lg is the set of vectors that are constant on each connected 
component of G; 

(d) (aI +Lg)7! is a matrix with only non-negative entries for any a > 
0. [Hint: a/ + Lg is strictly diagonally dominant with positive diago- 
nal entries. Write aJ + Lg = D— F where D is the diagonal part of 
al + Lg, and F has non-negative entries. Use the power series formula 
for (I — D7! F)7!1] 


Suppose that A is a symmetric strictly diagonally dominant tridiagonal matrix 
(laji| > |aii-1| + |aii+1| for all 7). Show that the magnitude of the entries 
of A~!D (D is the diagonal part of A) decrease exponentially rapidly away 
from the main diagonal. [Hint: Let a = max; (\ai,i—1| + |aii+1|) / ail < 
1. Writing A= D—F,A“'D=(U-D"'F)!=14+D'!F+(D"'FyY + 
- with |D-'F ||, =a <1] 
A standard procedure for eigenvalue/vector algorithms for real symmetric 
matrix A is to put a matrix into a certain sparse form by means of Givens’ rota- 
tions: A < G;j(c, s)TA G;;(c, 8) where G;;(c, s) is the identity matrix except 
that for rows i, j and columns i, j. Here Gj;(c, s)i; = Gij(c, 8);; = c while 
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Gij(c, 8) ji = —Gij(c, S)ij =s with C2 +s? = 1. Given (i, J) where Qij # 0) 
we can find (c, s) so that the (i, j) entry of G;;(c, sTA G;;(c, S) is set to zero. 
Show that unless there is “accidental” cancellation, this operation can be repre- 
sented by the following operation on the graph of A: for every k where i ~ k, 
add an edge j ~ k; for every k where j ~ k, add an edgei ~ k; delete the edge 
i ~ j. This graph operation is introduced in [238]. 

(10) Read about SuperLU [75], ahigh-performance super-nodal method for carrying 
out LU factorization with partial pivoting. Outline how the algorithm improves 
performance over naive implementations of sparse LU factorization with partial 
pivoting. 


2.4 Iterations 


The methods we have seen so far are direct methods: they give answers which would 
be exactly correct if done in exact arithmetic. But for large systems of equations, 
if the accuracy requirements are not too stringent, using iterative methods can give 
good results far more cheaply in terms of time and memory. Some iterative methods 
were developed in the nineteenth and early twentieth centuries. The 1950s saw the 
creation of the conjugate gradient method which differed from most of the previous 
method, by not requiring detailed knowledge of, or manipulation of, the entries of 
the matrix. Instead, the conjugate gradient method just required the computation of 
matrix—vector products, short linear combinations, and inner products. In the 1970s 
through 1990s a number of new methods of this type were developed. These methods 
are called Krylov subspace methods. 

The two approaches of classical and Krylov subspace methods are not exclusive. 
In fact, classical iterations are very useful as preconditioners for Krylov subspace 
methods. 


2.4.1 Classical Iterations 


The classical iterations and their variants are based on matrix splittings: to solve 
Ax = b for x we split A = M — N where M is invertible. The equation Ax = b 
can then be put into the form Mx = Nx + b. Then we can create the fixed-point 
iteration 


(2.4.1) Xnu1 <— M"(Nx,+6), k=0,1,2,.... 
Classical iterations include the Jacobi or Gauss—Jacobi iteration: choose M = D, 


the diagonal part of A, and N = D — A. One iteration can be implemented as Algo- 
rithm 22. The iteration matrix is Gy = I — D~'A. The iteration is 
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Algorithm 22 Jacobi iteration for Ax = b 
al function jacobi(A, x, b) 


2 for PH 1,2)..,7n 
3 s<—b; 

4 for ped. scgh 
5 if j#i 

6 S <8 —AjjXj 
7 end if 

8 end for 

9 Yi 8/ Ajj 

10 end for 

11 x<y 

12 return x 


13. end function 


Algorithm 23 Gauss-Seidel iteration for Ax = b 
1 function gauss_seidel(A, x, b) 


2 for (=1)23...4,7 

3 s<b; 

4 for j=1,2,...,n 
5 if j#i 

6 S <8 —AjjXj 
7 end if 

8 end for 

9 xi <— S/ajj 

10 end for 

Aa return x 


12 end function 


(2.4.2) Xe41 <— D-'((D — A)x, +b) 


The Jacobi method is more effective for sparse matrices as the number of floating 
point operations for one iteration of the Jacobi method uses < 2 nnz(A) floating point 
operations where mnz(A) is the number of non-zero entries of A. 

Another classical iteration is the Gauss—Seidel iteration. Now we choose M = 
D +L where D is the diagonal part of A, L the strictly lower triangular part of A, 
so that N is the strictly upper triangular part U of A. This is often preferred over the 
Jacobi iteration as it is not necessary to keep a separate vector to store the result, as 
shown in Algorithm 23. The iteration can be represented in matrix—vector form as 


(2.4.3) Xeu1 <— (D+L)![b— Ux,). 


The iteration matrix is Ggs = —(D + L)~'U. The number of floating point opera- 
tions for Gauss-Seidel per iteration is the same as for the Jacobi method. 

The successive over-relaxation (SOR) iteration is a variation of this which has 
an over-relaxation parameter w > 1, and pseudo-code for one iteration is shown in 
Algorithm 24. Using w = 1 is equivalent to the Gauss-Seidel method. The SOR 


2.4 Iterations 121 


Algorithm 24 Successive over-relaxation iteration for Ax = b 


al function sor(A, x, b,w) 
2 for pel, 2244, n 

3 s<—b; 

4 for j=1,2 n 
5 if j#i 

6 S <8 —AjjXj 

7 end if 

8 end 

9 xi <— (1 —w)x; +ws/aj;j 
10 end for 

11 return x 


12 end function 


iteration can be represented in matrix—vector form as 
(2.4.4) Xeg <— (D+wL)! [wh — (WU + ww — 1)D)x;x] 


with the iteration matrix Gsor = —(D + wL)7!(wU + (w — 1)D). The number of 
floating point operations per iteration for the SOR method is < 2 nnz(A) + 2n where 
A isn Xn and nnz(A) is the number of non-zero entries of A. 

There are block versions of these algorithms where instead of a;; being a scalar, 
we treat A;; as a block in the matrix A, while x = [xf, x3, ..., x7]? and b= 
[bi b5, ..., byl” have blocks consistent with the blocks A;;. The sums s become 
block vectors, and division by a;; is replaced by pre-multiplication by Aa, Many 
iterative algorithms for solving large linear systems can be understood as block 
Jacobi, block Gauss-Seidel, or block SOR iterations. 


2.4.1.1 Convergence Analysis 


The convergence of the iteration (2.4.1) depends on the matrix G := M~!N. Foran 
iteration 


Zeai<— Gzt+y we have 
Z=Gz+y, 
22 = G(Gaty)t+y=@a+(G+Dy, 
z3= G(G?z9+(G+Dy)+y 
=Gy+(G+G+Dy, 


ze = Geo t+ (Ge !4+---4+G4Dy. 
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This converges if ||G|| < 1 for some matrix norm ||-||, so | Gk | < ||G||* > 0, and 
Ge! 4+...-4G+1—> (I—G)"! ask > ow (see Lemma 2.2). There is a more 
precise convergence theorem. 


Theorem 2.15 The matrix G has the property that G‘ — 0 as k — 00 if and only 
if p(G) < 1 where p(G) is the spectral radius 


(2.4.5) p(G) = max {|A| : A is an eigenvalue of A}. 


Furthermore, if p(G) < 1 then I + G+ G* +++ converges and is equal to (I — 
Gy 


This result is a consequence of the following theorem: 


Theorem 2.16 For any square matrix A and € > 0, there is an induced matrix norm 
II-Iltey where 
Allo < p(A) +. 


Note that for any induced matrix norm and square matrix A, p(A) < ||A]l. 

Theorem 2.16 implies that if p(G) < 1 then for « = (1 — p(G))/2 there is an 
induced matrix norm ||-||(., where ||G||(.. < p(G) +e = 5(1 + p(G)) < 1, and so 
G* + Oask > coandJ]+G+G*+---= (1 —G)! by Lemma 2.2. The proof 
of Theorem 2.16 we develop here uses the Schur decomposition of Section 2.5.2. 


Proof Note that if A +> ||Al| is a matrix norm induced by a suitable vector norm 
v +> ||v|| then for invertible X, A te | X Ax"! | is the matrix norm induced by the 


norm v +> ||Xv||. To see this, let ||v||, = || Xv||. Then 
| Azll x || X Az|| 
Ally = max = 
240 |[Z|lx 240 || XZ] 
| xAx7'y| 
= max ——————__ where X¥z=y 
yA Ily lh 
= |xax-]. 


Now, by the Schur decomposition (Theorem 2.23), there is a unitary matrix U where 


U' AU =T with T upper triangular. Note that U' =U-!. Write T =D +N 
where D is the diagonal part of T so each dj; is an eigenvalue of A. The matrix 
N is strictly upper triangular (n;; = 0 for all i > j). Let S(q@) be the diagonal 
matrix with diagonal entries s ;;(@) = a/. Then (S(a)~'N S(a));, = a/~'n;; which 
is zero if i > j and goes to zero as a | O if i < j. That is, S(a)~'N S(a) > 0 
as a | 0. On the other hand, since D is diagonal and diagonal matrices commute, 
S(a)~!D S(a) = D.Choose a > O sufficiently small so that || S(a)~'N S(a)||, < €. 


Let X = S(a)~!U7!; then 
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XAX7! = S(a)'!U“'AU S(a) = S(a)7'T S(a) 
= S(a) !T S(a) = D+ S(a)'N S(a), and 
| XAX~'], < Dla + |S(a@)"N S(@)|, < Dll +. 


Recall that since D is diagonal, || D||. = / \max(D D) = max, |dkx| = p(D), its 
spectral radius. But since the eigenvalues of A are the eigenvalues of D, ||D||, = 
p(A). Thus in the induced matrix norm for the vector norm v +> ||Xv]l>, 


|Allx,2 = |XAX7"], < p(A) +e, 


as we wanted. 


Note that in any induced matrix norm p(A) < ||A||;if Av = Av andv ¥ 0, |A| ||v]| = 
|Av|| = ||Av]] < ||Al] |u|] so [A] < || Al]. Taking the maximum over all eigenvalues 
d gives p(A) < ||All. 

So the iteration z%4; <- Gz, + y is convergent if and only if p(G) < 1.If p(G) > 1 
there must be an eigenvalue \ with |A| > 1 andan eigenvector v 4 0 where Av = Xv; 
thenifz* = G z* + ywehavez,4,; — 2* = G(z, — 2*).Ifzo = z* + vthenz, = z* + 
Gv = z* + Mv # z*. The iteration does not converge. 

Returning to theiteration (2.4.1), weseeitis convergentifandonlyif p(M~!N) < 1. 

Convergence can often be easily shown for certain classes of matrices. Consider 
first strictly row dominant matrices: A is strictly row dominant if 


lai] > S> |aij| for alli. 
jiFi 
Non-strictly dominant matrices are defined in (2.1.6). We can immediately see that 


the Jacobi iteration is convergent as 


1 
PGs) < IG ylleo = |-DEL + U)lloo = max — JY ai] < 1. 


a 
lal A 


Gauss-Seidel is also convergent in this case as p(Ggs) = p(—(D + L)"!U)>1 
implies that there is an eigenvalue \ with |A| > 1. Butif —(D + L)~!Uv = Xv with 
v # 0, this would mean that \~'Uv = —(D + L)vand Dv = —(L + \7'!U)v. Then 


-1 y > 
—x ajjUj — QijUj = AjjiVj. 
jij>i jij<i 


In particular, this is true for the index i where |v;| = max, |ve| = ||v||,,. Then taking 
absolute values of both sides, 


i 
IAI SY lass Jus] + D> Jais| Jus] = laie| lil = laiel WU lloo - 


jij>i jij<i 
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Dividing both sides by ||v||,, and using |v; | / \lvlloo < 1, we get 


lal < IAI! > Jaij| + > lay 


jij>i dij<i 


’ 


which contradicts strict dominance as [At < 1. Thus there can be no eigenvalue of 
G of magnitude > | for Gauss-Seidel applied to a strictly row dominant matrix. 
While strict diagonal dominance is sufficient for convergence, it is far from nec- 
essary. For these methods, if A is symmetric positive definite then the methods 
converge. The best way to show this is via the Householder—-John theorem: 


Theorem 2.17 [fA and M + M’ — Aare complex Hermitian positive definite, then 
pU — M~'A) <1. 


Proof Suppose that (I — M~!A)v = \v with v 4 0. Then (1 — A)v = M~!Av so 
(1 — A)Mv = Av. Since A is positive definite, A is invertible, so \ 4 1. Thus 
Mv = (1— 4)~'Av. Pre-multiplying by 0’ gives v’ Mv = (1 — \)~!0" Av. Tak- 
ing conjugate transposes using A =A gives vM v= (1 — A)~!v" Av. Then 


_ 1 1 
o [+m — A] y =v" av| +571] 


Since the left-hand side is positive and 0’ Av is positive, | — |A|? > 0. Thatis |A| < 1. 
Therefore p(J — M~'A) < 1. 


This theorem can be used to show convergence of the Jordan and Gauss-Seidel 
methods. 


Theorem 2.18 /f A is Hermitian positive definite then the Gauss—Seidel iteration 
converges. If, in addition, the matrix 2D — A is positive definite where D is the 
diagonal part of A, then the Jacobi iteration converges. 


Proof Let D be the diagonal part of A, L the strictly lower triangular part of A, and 
U= L’ the strictly upper triangular part of A. For the Gauss—Seidel method, M = 
D + Land theiteration matrix isGgs = —(D+ L)"'!'U =I -(D+L)'A=I1- 
M~—'A. To show that the conditions of Theorem 2.17 hold, we first note that A is 
Hermitian positive definite. We also need to show that M+ M' — Ais positive 
definite. Now M+M =D+L+D +L =2D+L+UsoM+M —A= 
2D+L+U—(D+L+U)=D which is positive definite as each dj; > 0 by 
positive definiteness of A. Therefore, we can apply Theorem 2.17 to conclude that 
p(Ges) < 1. 7 ; 

For the Jacobi method, M = D=D soM+M —A=2D — Awhichis pos- 
itive definite by assumption. As A is also positive definite by assumption, we can 
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apply Theorem 2.17 to conclude that p(G;) = p(J — D~'A) < 1 and so the Jacobi 
method is convergent. O 


Example 2.19 To illustrate how these methods work in practice, consider the prob- 
lem of solving a discrete approximation to the Poisson equation 


Cu Ou 
(2.4.6) ax2 + byt = f(x, y), (x, y) € [0, 1] x [0, 1] 


with zero boundary conditions: u(x, 0) = u(x, 1) = u(O, y) = u(1, y) = 0 for all 
x, y € [0, 1]. Using the discrete approximations 


am ne u(Xi41, Yj) — 2UG, Yj) + UCG-1, Yj) 
Ox? ee (Ax)? , 


where x; = i Ax and y; = j Ax and Ax = 1/(N + 1), we obtain the equations for 
the approximate values u;,; © u(x;, yj): 


Ui4i,j + Ui jt — 4Ui,j + Ui-1,j + Ui,j-1 
(Ax)? 


(2.4.7) = f (xi, yj), 


taking u;,; = Ofori = Oori = N+ lor j =Oor j = N + 1 for the zero boundary 
conditions. 


This example is used with the right-hand side f(x, y) = | forall (x, y) to generate 
the results shown in Figure 2.4.1. 

As can be clearly seen from Figure 2.4.1, for this example, the Gauss—Seidel 
iteration converges faster than the Jacobi iteration. The residual norms for the Jacobi 
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Fig. 2.4.1 Results for classical iterations 
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iteration are used to estimate p(G,). For a class of matrices having “Property A” 
[265] in addition being symmetric positive definite, Gauss—Seidel is twice as fast as 
the Jacobi method: p(Ggs) = p(G,)?. Property A is that the graph of the A — D is 
a bipartite graph, that is, the vertices of the graph of A — D can be split into two 
disjoint subsets V = V; U V2 where every edge joins a vertex in V; with a vertex in 
V>. If the rows and columns are ordered consistently with this partition of vertices, 
then we can compute the optimal w for the SOR iteration (2.4.4) which is given by 
[225, p. 112-116]: 


1 
Wie = . 
BS Lb /T = Ga? 


This value of w is optimal in the sense of minimizing p(Gsoz). Using this value from 
the estimate of p(G;) gives the results shown in Figure 2.4.1 for the SOR iteration. 


(2.4.8) 


2.4.2 Conjugate Gradients and Krylov Subspaces 


The conjugate gradient method was first published by Hestenes and Stiefel [122] in 
1952. It applies to symmetric positive-definite linear systems Ax = b. The derivation 
of the conjugate gradient algorithm is often framed as minimizing a quadratic convex 
function f(x) = 5x7 Ax — b"x +c; this is equivalent to the linear system Ax = b 
provided A is symmetric positive definite. 

The conjugate gradient method was initially considered to be a candidate for a 
general-purpose direct solver for solving symmetric positive-definite linear systems 
as in exact arithmetic it should solve ann x n linear system in n conjugate gradient 
steps. Numerical issues and a certain kind of numerical instability meant that it did 
not work well in this way. However, as an iterative method, it works very well, 
often taking far fewer than n steps to give sufficiently accurate solutions. While the 
performance of the conjugate gradient method is generally good, it can be improved 
by the use of preconditioners. 

Of particular importance for the success of the conjugate gradient method is that 
the method only requires the ability to compute matrix—vector products Az for given 
vectors z. This means that functional representations of A canbe used: A is represented 
by afunction x +> Ax. This is useful, not just for sparse matrices, but for many other 
matrices that can be represented efficiently, for example, by using a combination of 
low-rank matrices and discrete Fourier transforms as well as sparse matrices. 


2.4.2.1 Derivation and Properties of the Conjugate Gradient Method 


Minimizing a convex quadratic function 


1 
f(x)= xe Ax —b’x+c 
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is equivalent to solving the linear system 
Vf(x) = Ax—-b=0. 


For positive definite A, this can be done using a method called the conjugate gradient 
method that has close ties to optimization. The basic idea is to write 


X= XQ + A0Po + Py +--+ + O-1PE-1, 


where the a;’s are scalars and the p,’s are search directions. Then f (x,;) is a quadratic 
function of the a;’s, and finding the minimizing a;’s can be done by solving a linear 
system. But linear systems can be most easily solved when they are diagonal. The 
linear system for the minimizing a;’s is 


k-1 
p) | A@ot+ > ajpj)—b} =0 fori =0, 1,2, ...,k-1. 
j=0 
The linear system is k x k with (i, j) entry given by p} Ap j- This matrix is diagonal 
if 
(2.4.9) p; Ap; =0  foralli 4 j. 


This is the condition that p,’s are A-conjugate, or just conjugate if A is understood. 
If the p,’s are conjugate (with respect to A) then the system of linear equations 
for the a;’s becomes simply 


i-1 
BP} Ap; % = p; (b — Axo) = p} | b— Alo + Da; p;) 
j=0 


Note that increasing k does not change the value of a; for i < k. This means that 
Xx41 = X_ + 0% p,. Note that x; minimizes f (xo + aa a; p;) over all a;’s, that 
is, x, minimizes f(z) over all z € Xo + span{ Po, ..., Pxz_1 }. This is the conjugate 
gradient minimization property. 

Letr; = Vf (x,) = Ax, — b. In optimization, this is clearly the gradient; in linear 
algebra, it is called the residual for xx. 

If we had a sequence py, p;, ... of conjugate vectors then we could design an 
iterative algorithm for minimizing f(x) as shown in Algorithm 25. 

The problem now is to find out how to generate the conjugate p,’s. 

At the beginning, any py) 4 0 by itself is conjugate. So we have a place to start. 
We can then proceed using mathematical induction. Now let’s suppose we have 
generated py, Pj, ..-., Pp; which are, so far, all conjugate. We will also show that 
the residuals ro, 71, ..., rx are all orthogonal, and that span{ro, 71, ..., re} = 
span { Do, Pus +s Px} = span {ro, Aro, A2Pro, .--, A*ro}. These will be our 
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Algorithm 25 Minimization with given conjugate directions 


1 function conjdirns(A, b, xo, (Po, Pi, P2;---)) 


2 for k<0,1,2,...,n 
3 ry << Ax, —b ; 

4 a <— —pere/ PAPE 
5 Xk+L <— XE + OK Dx 

6 end for 

o return Xp 

8 end function 


Algorithm 26 Conjugate gradients algorithm — version 1 
al function conjgrad1(A, b, xo, ©) 


2 while ||Ax; —bl|, >e 

3 re <— Ax, —b 

4 if k=0: pp <-—r;, end if 
5 on <— —ph re / pL AD, 

6 Xe-1 “XK + Ok Py 

7 Be <— hy ;ADy / PLAPE 

8 Prat — Tet + Oxy 

9 k<k+l1 

10 end while 

pi a return Xx 


12 end function 


induction hypotheses. Note that this last subspace 
(2.4.10) K,(A, ro) = span{ro, Aro, A°ro, ..., A*ro} 


is the Krylov subspace generated by the original residual r9. Note that py = —ro so 
all the induction hypotheses are true for k = 0. Now we want to find p,,, so that 
Po> Pts -++> Pk» Pry, are all conjugate, and the other induction hypotheses are true 
with k replaced by k + 1. 

How can we generate p,,,;? We can compute x,,; from po, pj, ..., Px SO 
we can compute r;,,;. We are going to suppose that we have a particular form 
Prat = —he+1 + Gp, for p,,,. The amazing thing is that this will work. But more 
on that later. We can find 6, by requiring conjugacy between p, and p,,, using 
symmetry of A so that pf Arky1 =1{,, A" Py = TU, ADe: 


0 = ppAPery = Pe A(—resi + GePy) 
= —11,,ADe + Ge DL AP, — 80 
rit APy/ Pi ADK: 


(2.4.11) By 


This ensures conjugacy between two vectors: pA Px; = 9. But it is enough, as we 
will see. 

The algorithm with this way of generating the conjugate directions is the conjugate 
gradient method which is shown in Algorithm 26. 
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There are several different ways of representing the conjugate gradient method 
for linear symmetric positive-definite systems of equations; they are all equivalent 
because of the many relationships between the r;’s and p,’s. 


Theorem 2.20 Jn Algorithm 26, for eachk =0,1,2,...andi, j <k, 


(2.4.12) p; Ap; =0_ foralli F j, 
(2.4.13) rjr;=0  foralli F j, 
(2.4.14) 
span {ro,1\,..., 1%} = span { Po, Disesxs Px} = span{ro, Aro, .-.,To}. 


Proof We use proof by induction on k. 

Base case: For k = 0, we have (2.4.12) and (2.4.13) holding trivially, while 
(2.4.14) holds because py = —ro. 

Induction step: Suppose (2.4.12, 2.4.13, 2.4.14) hold for k = m. We show that 
these hold fork = m+ 1. 


First note that ry,41 € span { P,, Pmt} SC span { Do, Dis ees Pi}: On the 
other hand, 
Pm+i = —h m+ + Bin Pin E€-Tmsit 8mspan {ro, Tj, ---, rn} 
C span {ro, 1, ---, Tm+i}- 
So span {ro, a Ce se rm+i} = span { po. Pi> +--+ Pms Pari} Also aes =0 
forall 7 < m from our optimality result. 
That is, 7,41 1s orthogonal to span { Pos Pig ase Pn} = span {ro, 71, ---, Tm} 


this means that Teidha =0 for j <m. Note that ry+1 =m + QmAp,, (since 
Xm+1 = Xm + Qmp,,). Then 


rm+i € Span {ro, AMo, ---, A™ro} + QA span {ro, APo, ---; A™ro} 


m m+1 
C span {ro, Aro, ..., A ro, A ro}. 


On the other hand, A”*+!rg = A(A”ro) and A” ro € span {ro, Aro, ..., A”ro} = 
span { Po, Pj. ---» Pm}, SO we can write A”rp = Dino WP; and A” try = 
Yi=0 Yj AP; = Yi=0 “ce Opa —rj) €span{ro, ri, ..., Pm4i}. (Note that 
a; #0 since a; = 0 implies that r;,; =1j; orthogonality of the residuals then 


implies that 0 = reat j= rir ; and r; = 0, so the algorithm must have already 


terminated.) Thus span {7o, 71, .--, fm+i} = span {ro, APo, .--; Amira, 

Now we show p/Ap,,,;=0 for all i<m. For i<m, p,€ 
span {ro, Aro, ..., A’ro}, so Ap; € span{ro, Aro, ..., A’ro, A’T'ro} = 
span {ro, r1, ..., rizi}. Note that p/ Ap,,,, = p,,,;Ap; (since A symmetric). 


Then p),,;Ap; = —r)4,Ap; + Gnp,Ap; by the formula for p,,,,. But since 
i<m,ry,,,Ap; =0 as rj.ijrj =0 for all j <m, and p) Ap; =0 for i <m 
by the induction hypothesis. Thus p7, 4,,Ap; = 0 provided i + 1 < m, that is, for 
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Algorithm 27 Conjugate gradients algorithm—version 2 
1 function conjgrad2(A, b, xo, ©) 


2 k<0; ro< Axo—b; po <—-—ro 
3 while |Irgllo > € 

4 Gk — APy ; 

5 OK — perk / PEAK 
6 Xk41 “Xk + A Py 

7 etl e+ Kg, 

8 Be <— Phares /P pre 
9 Prat — —Tke+i + Pe Dx 
10 k<k+l1 

11 end while 

12 return Xx, 


13. end function 


i <_m we have p/ Ap,,,,; = 0. As we have already shown that p),Ap,,,, we see 
that Po, Pj, -+-+ Pm» Pm41 are conjugate, as we wanted. Thus, all the induction 
hypotheses have been shown to be true for k replaced by m + 1. 

Thus, by the principle of induction, the result is proven for k = 0, 1, 2,.... 


2.4.2.2 Reformulating the Algorithm 
The formula for G41 =r 41 Ap / p;. Ap; can be modified using the properties that 
have been discovered, such as the orthogonality of the r;’s, and ry4, = A(x, + 


Ox P,) — b= rp t+ aKAp,. So Ap, = (Tea — x) / Ox. Therefore, 


T T 
Tye APE =Veg ee —1e)/a and 


PLAP, = Pi (Tet — Fe)/OK = —PerK/ OK 
since rz41 is orthogonal to p,. But py = —rp + Gx-1Pe_1 SO PETK = (—re t+ 
Be—1Pe—1)' Pk = —PLTE aS Fi, Py; = 0. By orthogonality of the rj’s, ri, rx = 0. 


Sori, ;ADe/ PAPE = Vig Peei/TEP k- 

Using the update ry41 <— rz, + a,Ap, instead of using rp41 < Ax,x+1 — B, this 
new formula for 6,41, and gq, = Ap, gives Algorithm 27. This makes clear that 
there is only one matrix—vector multiplication per loop. The matrix A can be rep- 
resented by a function, and this function is only called once per loop. Apart from 
the representation of the matrix A, the only vectors needed to be stored are x,%, rx, 
P,, and q;,. That is, to solve ann x n linear system, apart from the memory needed 
for a functional representation of A, the amount of memory needed for conjugate 
gradients is © 4n floating point numbers. This means that extremely large problems 
can be solved using conjugate gradients. 
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2.4.2.3 Rate of Convergence 


One of the important properties of the conjugate gradient algorithm is that 
Xx41 Minimizes f(x) := $x" Ax — b'x over allx € x9 + span { Pos Pisses Px}: 
The other fact is that span { Do, Pixs ess Px} = span {ro, Aro, .--; A*ro}. 
Then x; —Xo is a linear combination of vectors A/ro. Such a_ space 
span {ro, Aro, ..-; A‘ro} is called a Krylov subspace. This means that we can 
represent X;%41 — Xo in terms of polynomials: 


k k 
Xin1—X0 =) Alto = | WA! | ro = (A) ro, 
j=0 j=0 


where g is a polynomial of degree < k. If x* is the minimizer of f (that is, Ax* = b), 
then f(x) — f(x*) = 5 |x — x* I, where ||u||, = /u? Au is the norm generated 
by A. 

We can study this better by using eigenvalues and eigenvectors of A: 


Avj = Xi0;, [vi ll2 = 1 


with v} v; = 0 if i A j. Note that if g is a polynomial, then g(A)v; = g(Aj) 0;. 
Let xo —x* = )0%_,cjvj. Then ro = Axo — b = A(xo — x*) = Dj Ajejvj. 
Also, with xx4; — Xo = g(A)ro = g(A)A(xXo — X*) = ae g(Aj)Ajcjv,; so that 
Xep1 — H* = PM + AjgQay))ejv;, and 


ecu — 27), = ay [P+ AjeOpf 3 
j=l 


n 


v3 
< (max 1+ A,¢0)1) De; 


j=l 


2 
= (max|1+ 80!) bal 


The polynomial g of degree < k is chosen to minimize ||x,.; — x* lee So we look 

for polynomials g that make max eo 4) |1 + Ag(A)| small where \ ranges over the 

set of all eigenvalues o(A) of A. If we write g(\) = 1 + Ag(A) we note that g is a 

polynomial of degree < k + 1 with qg(O) = 1 (that is, the constant term in g is 1). On 

the other hand, for any such g, g(A) = (q(A) — 1)/A is a polynomial of degree < k. 
If A has only a few distinct eigenvalues A), Az, ..., Ay, we can set 


4g) = QDMA =) A=W) 
POA) 
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and we can see that g(A;) = 0 for all eigenvalues of A, and x,,, = x*. This would 
mean exact convergence inr + | steps. In practice, it rarely happens that a matrix has 
only a few exactly repeated eigenvalues. However, it may happen that the eigenvalues 
fall into a few small clusters. Then the choice of g(A) given above would make 
max \eo(a) |1 + Ag(A)| small. 

If we know nothing of the eigenvalues of A except for the extreme eigenvalues 
0 < A, (smallest) and A, (largest), then we can try to minimize maxj¢_),,,] 1¢(A)| 
with the constraint that g(0) = 1. This leads to Chebyshev polynomials (4.6.3): 


T;(cos @) = cos(j@), J =O), W523 Syncs 


It is not immediately obvious that these are, in fact, polynomials. However, it is easy 
to check that To(x) = 1, T;(x) = x, and using the trigonometric addition formulas 
we can show that 


Ty41(x) = 2x T(x) — Th-1 (x), k= 1, 2; 3; cee 


The nice property of these polynomials is that |7; (x)| < 1 whenever |x| < 1. The 
optimal g is given by 


Tri = 2(A = A1)/On — Ai) 


x = 
GON TF s(On-+ M/On — aD) 


Then for A; < A < Aq, |Tey1(A)| < 1. Thus maxyera,,a,) |g(A)| = 1/ Tre (An + 
A1)/(n — A1)). Note that the formula 7T;(cos@) = cos(j@) will not work 
for T;(x) with |x| > 1. However, there is another formula that works here: 
T;(coshu) = cosh(ju). For An/Ay >> 1, (An #AL)/On — At) = 1 + 2A1/On — 
My) © 1+ 2A,/Aq. Setting (Ay + Ay)/(n — A1) = cosh u gives u © 2./X,/A,, and 
Tei ((an + At)/n — A1)) & cosh(2(k + DM Aa) & ED + VM Pn), 
This means that the number of iterations needed to achieve a certain error tol- 
erance € is O(log(1 /€)K2(A)!/?) since for a symmetric positive-definite matrix 
K2(A) = An (A)/A1 (A). 


2.4.2.4 Preconditioned Conjugate Gradient Algorithm 

Because the rate of convergence depends on the condition number (A), it is desir- 
able to reformulate the system of equations so as to reduce this condition number. 
We could try to reformulate the system of equations Ax = b as 


(2.4.15) M~'Ax = M~'b. 


But, unless we are extremely lucky, we do not expect M~'A to be symmetric even 
if its condition number is much better. However, it is self-adjoint with respect to 
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Algorithm 28 Preconditioned conjugate gradients 


1 function pcconjgrad(A, M—!,b, xo, €) 


2 k<—0; ro — Axo —b; 29 <— M'ro; Po < —Zo 
3 while |rgllo > € 

4 Wk <— AP, 

5 on <— —zi re / ph aK 

6 Xe-1 “XK + Ok Py 

7 eet e+ Ody 

8 Ze — Mo res 

9 Be — Zp Peri /ZEPk 
10 Pray — —Ze41 + Pe Px 
qa k<ek+l1 

12 end while 

13 return Xx, 


14 end function 


the inner product (u, v)y =u’ Mv. This defines an inner product provided M is 
symmetric and positive definite. To see that it is self-adjoint note that 


(u, M~'Av)y =u’ MM 'Av=u' Ad, and 
(M~'Au, v)y = (M7! Au)’ Mv =u! ATM? Mv =u’ Av 


by symmetry of A and M. 
We aim to choose M where 


M is symmetric and positive definite, 

Mz = ycan be solved for z easily, 

the function z+ M~'z can be efficiently implemented, 
ko(M~'A) is much less than 2(A). 


We transform the code (setting z; = M"'r; = MM"! (Ax ; — b)) from Algorithm 27, 
replacing A with M~'A and u? v with (u, v)y = u" Mv. Note that this means that 
p’ Aq is replaced by p’ M M~' Aq = p’ Aq. This gives Algorithm 28. 

In Algorithm 28, we can use functional representation for both A and M~! with 
one function evaluation for each of A and M~! per iteration. Note that Algorithm 28 
with M = al for any a > 0 is equivalent to Algorithm 27. 

Much work has gone into determining how to create a good preconditioner, espe- 
cially for specific families of linear systems. Readily implemented precondition- 
ers for conjugate gradients include the symmetrized Gauss-Seidel preconditioner 
Msgs = (D — U)~!D(D — L)~! where Dis the diagonal of A, L is the strictly lower 
triangular part of A, and U is the strictly upper triangular part of A. A similar sym- 
metrized SOR preconditioner is given by M5sop = (D —wU)"'D(D —wL)"!. 
The exact choice of the parameter w is less crucial for its use as a preconditioner than 
for using the SOR method directly. Another possibility is to use incomplete sparse 
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Fig. 2.4.2 Comparison of classical iterative methods with conjugate gradients with and without 
preconditioning. 


matrix factorizations. For other problems, like large systems that come from partial 
differential equations, specialized methods such as multigrid methods are very good. 

If we compare the classical methods shown in Figure 2.4.1 with the results for 
conjugate gradients, with and without preconditioning, we can see the results in 
Figure 2.4.2. The preconditioner used is the symmetrized Gauss-Seidel (SGS) pre- 
conditioner mentioned above. 


2.4.3 Non-symmetric Krylov Subspace Methods 


Krylov subspaces are natural vector spaces to consider for iterative methods. We 
suppose that we start with a vector xy 4 0, and the operations we are allowed to 
perform are linear combinations, inner products, and matrix—vector products x +> 
Ax. Of these, we will assume that computing matrix-vector products is the most 
computationally intensive operation. The question naturally arises: What is the set 
of vectors we can obtain with linear combinations and no more than r matrix— 
vector products? The answer is K,.(A, x9) = span {xg, Axo, ..., A”Xo}, which is 
the Krylov subspace generated by A and xo of dimension r + 1. We will see how 
we can use this idea to develop methods for solving equations and also finding 
eigenvalues and eigenvectors (see Section 2.5.5). 
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2.4.3.1 The Lanczos and Arnoldi Iterations 


The Arnoldi iteration [9] is a way of generating an orthonormal basis for Krylov 
subspaces K,(A, Xo) = span {x9, AXo,..., A’ Xo}. While it is normally understood 
as part of a recipe for computing estimates of eigenvalues and eigenvectors, it is a 
cornerstone of many algorithms for solving large systems of equations. 

The basis of the idea of the Arnoldi iteration is fairly simple: given a previously 
constructed orthonormal basis {v,, v2,..., vg} for K,(A, v1), we use the Gram— 
Schmidt process (2.2.11) to compute a new element v,+; of an orthonormal basis by 
orthogonalizing Av; against v1, v2,..., v~ and normalizing: 


Avy — (Avg)! v1)0; — ++» — (Avg)! vg) 0g 


Wry1/ ||Wesrrlle - 


WI 


VE+1 


As noted in Section 2.2.2.2, the original Gram—Schmidt process has numerical insta- 
bilities that can be avoided by using the modified Gram—Schmidt process. Setting 
hey = (AT vx)" vy for £ < k and hyss.4 = ||wWer1 ||, we see that 


k+1 


Avx => So have. 
l=1 


Setting hy, = Ofor £ > k + 1. Following the Gram—Schmidt reconstruction (2.2.12), 
we can write 


hy hin--+ hy 
hy hog +--+ ho x 
A[v1, 02,..., 0%] =[01, ¥2,..., VK, Deri] | O hae” 
0 O ae he g 
0 O -:+ hei 
hy hy hy. 
hy, hr» ho x é 
= [0], V2,..., UK] geting a SE Agr cVepiey « 
O +++ Agi hee 
If we put 
Ay hyn +++ Ay 
hy hy +++ hag 
V, = [v1, v2,..., vg] and Ay = . . , then 


O +++ Agi her 
A Vg = Vie + hi eves; - 
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Algorithm 29 Arnoldi iteration 


1 function arnoldi(A, v1,m) 


2 vy — v1 / [lv |12 

3 for k=1,2,...,m 

4 wy <— Avg 

5 fot pas 2. nas k 

6 h jx — wiv; 

% We <— WE - hjxvj 

8 end for 

9 hysik <— llwelle 

10 Dept — We/hesik 

ale end for 
Ay hig +++ him 
hay hog +++ ham 

12 Vin =[v1, V2,---5 Um); An = : 
0 . . : 
0 -:-- hm—1,m Linn 

13 return (Vn,Hm, m-+i,m 1 Um+1) 

14 end function 


The matrix H; is zero below the first sub-diagonal (h;; = 0 if i > j + 1); matrices 
with this structure are called Hessenberg matrices (see Section 2.5.3.4). Since Vj is 
a matrix of orthonormal columns Ve V; = I, we have 


Vi AV = Vi (Vi He + hcg ce y 1) ) 
= VE Vie + hei eV vegie, = Ak 


as Vi ve = 0. The Arnoldi iteration is shown in Algorithm 29. 

If the matrix A is symmetric, then there are very important computational advan- 
tages. Since H, = V,’ AV;, if A is symmetric then H/ = (V,) AV)’ = VJ ATV = 
V,) AV; = Hy is also symmetric. As Hy is symmetric Hessenberg, H; is symmetric 
tridiagonal. This means (at least in exact arithmetic) that the Gram—Schmidt process 
stops after orthogonalizing w,; against v, and v,_,. Using these observations gives 
the Lanczos iteration [152], as shown in Algorithm 30. 

As with the Arnoldi iteration, the Lanczos iteration has A Vin, = Vin Tin + Gin Vm es 

Both the Arnoldi and Lanczos iterations can break down if we get hk+i4 = 
0 or & =O. In either of these cases, wy, = 0 in line 9 of Algorithm 29 or 
line 7 of Algorithm 30 after the Gram—Schmidt process. This means that Av; € 
span {v1, V2,..., ve} = span {v1, Avj,..., Alyy}. This means that range(V;) = 
span {v1, Avj,..., Aly} is an invariant subspace: Arange(V,) C range(V;). 
This is actually good for both eigenvalue problems and solving linear systems: 
span {v), V2,..., v¢} must then contain an exact eigenvector. If v; = b for solving 
Ax = b, exact breakdown means that span {v), v2,..., v,} must contain an exact 
solution of Ax = b. 

The bigger problem for these methods is when hy+1,4 ~ 0 or G; © 0. Then round- 
off errors being amplified in the steps vg41 <— we/ hepsi, and vg <— wz /S,. This 
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Algorithm 30 Lanczos iteration 


1 function lanczos(A, v1, m) 


2 vp — 041/ |i 127 So <0; v9 —O 
3 for k=1,2,...,m 
4 wy, <— Avy 
5. Ak <— wi vg 
6 Wk Wi — OVE — PR VK-1 
7 Px = |lwelle 
8 Ves — WE/ Fk 
9 end for 
ay By | 
(a4 
10 Vin =[01,02,...,Um]i Tn = A - 
O°. .. Bm-1 
0 Bm-1 Am 
11 return (Vin,Tm, Bm ,Um+1) 
12 end function 
results in gradual loss of orthogonality of the vectors v,, v2,..., vg. This can be 
repaired by re-applying the Gram—Schmidt process to vj, v2,..., vg, or by using a 


sequential version of the Householder QR factorization (see Section 2.2.2.4). 
There is one more iteration that we should mention: Lanczos biorthogonalization. 
This does not create orthonormal bases or even orthogonal bases. However, it does 


create two sets of vectors {v1, v2,..., Vm} and {w,, W2,..., Wm}, where 
r 1, ifk=8@, 
(2.4.16) Vv, We = : 
0, ifk AZ. 


The method of Lanczos biorthogonalization is equivalent to the Lanczos itera- 
tion (Algorithm 30) if A is symmetric and v; = wj, in which case vj; = w; for 
j =1,2,.... Another aspect that it has in common with the Lanczos iteration is that 
it involves short sets of operations, unlike the Arnoldi iteration (Algorithm 29). As a 
result, Lanczos biorthogonalization only requires O(n) memory. Lanczos biorthog- 
onalization is shown in Algorithm 31. 

Lanczos biorthogonalization is the basis of the QMR method (see Section 2.4.3.4). 
The choice of yy; on line 7 is somewhat arbitrary. The important point is that Gj; = 
VF 1 Wj+1, SO as to ensure the biorthogonality property (2.4.16). Note that the a;’s, 
@;’s and y;’s form a non-symmetric tridiagonal matrix 


ay Bi 
y1 a2 22 
Tn _ 2 3 


Bn-1 


Yn-1 Om 
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Algorithm 31 Lanczos biorthogonalization 


1 function lanczosbiorthog(A, A’, v1, v1, m) 


2 vp — 0; wo <0; By — 0; y-<O0 

3 for j=l, gee 

4 aj <—w; TA; 

5 Tit Av; — 0, vj — Bjvj-1 

6 Wir A a a. YjWj-1 

i) Yr [e711 

8 if yj41=0: return end if // failure 

9 Bj < WF, 1 Wi41)/Vj 

10 Vj+1 — F/yj 

ale Wjat — Wi41/G; 

12 end for 

13 return (qj,..., Om), (B1,..+5 Bm—-1)1) (ts +++s Ym-1)» 
(V1,...,Um), (W1,...,Wm) 


14 end function 


The main properties for Algorithm 31 are 


T 
A Vin = Vin Tin a Vm Vm+1e n> 


T T 
A Wn = = Wn T Bm Wm+1em> 
T 
W, AVin = = tm- 


m 


Lanczos biorthogonalization can break down with 0; w j = 0. If this happens with 
either 0; = 0 or w; = 0, then we can nappy terminate the algorithm as we have at 
least one invariant subspace. However, if 0 vy w, = Oand bothv; 4 Oand w; 4 Owe 
have a serious breakdown of the method. There are ways of repairing this problem 
using what are known as look-ahead Lanczos algorithms [198]. 


2.4.3.2 GMRES 


The Generalized Minimum RESidual (GMRES) method [224] is based on the 
Arnoldi iteration. The basic idea is to minimize || Av — b||, over all v € K,,(A, b), 
the Krylov subspace generated by b. We can write v = V,, y for some y € R”. Since 
AVin = = Vin Ain a m4, mUmti1e2, 


m? 
Av= AViny = Vin Hmy a Ams imUmt1eny 


An 
= [Vino Vn+t] F r| y= Ving Hn Y where 


m+1,men, 
H _ An, 
m—T) p T |: 
m+1,me mn, 
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Note that v, (the first column of V,,) is b/||b\|,, so b= V,,(|bl|2 e1). Since 
[Vin Un+i] has orthonormal columns, ||[Vin, Uin+i]Zll2 = ||Zll2. We therefore wish 
to minimize 7 

|| Any — Ilbll2 ||, over y € R”. 


This can be accomplished using the QR factorization (see Section 2.2.2). In fact, the 
QR factorization can be computed especially efficiently because H,,, is a Hessenberg 
matrix by using Givens’ rotations (see Section 2.2.2.5). 

For the error analysis of the GMRES algorithm, we note that v = V,, y minimizes 
|| Av — B||, over K,,(A, Bb) and so we can apply the perturbation theorem for linear 
systems (Theorem 2.1) to bound the error in the solution in terms of the condition 
number &2(A). This condition number can be estimated by the least squares condition 
number k7(H,,) (see (2.2.5)). 


2.4.3.3 Least Squares Based Methods 


Since conjugate gradients can be applied to symmetric positive-definite linear sys- 
tems, they can be applied to the normal equations (2.2.3) A? Ax = AD for least 
squares problems min, || Ax — b||,. However, the rate of convergence is controlled by 
k2(A? A) = k2(A)’, which means that the number of iterations is O(K2(A) log(1/e)) 
for a relative error of €, instead of O(#2(A)!/? log(1/e)) as we would get if A were 
symmetric positive definite. To avoid this squaring of the condition number, one 
approach is to apply the Lanczos iteration (Algorithm 30) to the symmetric but 
indefinite matrix 

B= lar 3 with starting vector a ; 
Assuming that ||; ||, = 1, this gives a sequence of vectors forming an orthonormal 
basis of the form 


uy, 0 urn 0 
o-[t]-~-[S}-=-[3}--[2]-- 


This means that in the Lanczos iteration a; = vi By j = Ofor all j, and that 6);_; = 


r 


T eT AT jc ail : eae —_ 
v5; Bv2j-1 =wA uj =u; Aw; while 3); = 02;41Bv2; = U4) 


(Go; and 6; = (2;-1. Then the tridiagonal matrix 


TO oy 

6) 011 
1 0 4b 

Tom = bp 0 


| 
| . 
+ 


L Ym-1 
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Permuting the rows and columns to put the odd numbered rows and columns before 
the even numbered rows and columns, represented by permutation matrix P,,, gives 


6 
‘ ni O2 
(2.4.17) P! Tom Pm = ™ | where By = 
m BT . a 
. Ym—-2 Om-1 
Ym-1 
This algorithm is called Lanczos bidiagonalization. If Vam = [v1, V2, ..., V2m] then 
-_ Um 
Vin Pu =| Wile 
where U,, = [U1,...,Um] and W, =[w ,...,Wm] which are matrices with 


orthonormal columns. Using the relationship 


Bin Vom = Vom Ton =“ om V2m Cdyn we get 
ATUm = BY Wm and 


m 


A Win = BnUm a bmUm41@n,- 


Note that B,, is mx (m— 1). The LSQR algorithm of Paige and Saun- 
ders [195] uses Lanczos bidiagonalization to obtain the least squares problem 
miny || Bny — ||b\l2 e1||,, which is solved by means of a QR factorization using 
Givens’ rotations and gives the approximate solution of the least squares problem 
|| Ax — dll, asx = Wy. 


2.4.3.4 Quasi-Minimal Residual Method 
The Quasi-Minimal Residual (QMR) method [98] is based on Lanczos biorthogo- 


nalization (Algorithm 31) [225, Alg. 7.4, pp. 212-214]. The tridiagonal matrix T,, 
produced by the method is used to create a least squares problem 


Tin 
— ||bll,e 
eae Dllo e1 


and then the solution is x = V,, y. The main trick is to build up y and x as the 
Lanczos biorthogonalization proceeds, so that it is not necessary to store the v; and 


min 
y 


’ 


2 


: . er T, eae 
w ; vectors. The implementation uses a QR factorization of the a matrix using 


men 
Givens’ rotations. If the jth Givens rotation uses the (c; = cos @;, 5; = sin@;) pair 
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to represent the 2 x 2 rotation matrix, Saad [225] shows that the computed solution 
xX» has residual bounded by 


m 


|Axm—Bll> < Vnll2 | | [|i] } Wella 


j=l 


The value of || V,, ||, can be estimated during computation by using || Vin|l2 < ||Vinll- 
where ||-|| - is the Frobenius norm. 

Other non-symmetric Krylov and related methods have been developed. Of note 
are the Conjugate Gradient Squared (CGS) method of Sonneveld [236] and a 
transpose-free QMR method [97]. 

As always with these Krylov subspace-type solvers, creating good preconditioners 
is enormously beneficial. And almost always, finding them is a problem-dependent 
task. 


Exercises. 


(1) The standard discretization of the Laplacian operator on a region Q C R* ona 
grid of points (x;, yj) € Q with x; = x9 +ih and y; = yo + j his given by 


4uij — Ui+1,j — Ui-1,7 — Yi,j41 — Yi,j-1 
j i j j j 
h? , 


—V*u(xi, yj) © 


where uj; © u(x;, yj). To be specific, we consider the discretization of 
the problem where Q = { (x,y) |x27+y? <1 | with boundary conditions 
u(x, y) = g(x, y) on OQ. We deal with the boundary conditions by setting 
ui = g(%, yj) if (xj, ys) € Q. 


Let A,u, = f, be the linear system representing the discretization of the Pois- 
son equation 


—-Wu= f(x,y) inQ, 
u(x,y)=0 for (x, y) € OQ, 


where u,, is the vector consisting of u;; for (x;, yj) € &. 


Use conjugate gradients to solve A,u, = f;, for h=1/N with N= 
5, 10, 20, 40, 100. Use the stopping criterion || Anwn fy led | Fr |, < €for 
a suitable value of €. To be specific, put ¢ = 10~>. Report how the number of 
iterations needed to achieve this stopping criterion changes with NV. 

Solve the equations in Exercise 1 using GMRES for N = 10, 100. Compare 
the rate of convergence with the conjugate gradient method. Since convergence 
tends to be exponential in the number of iterations, it can be useful to plot the 


(2 


Ym 
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(3 


(4 


wm 


) 
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scaled residual norm || Anwn,n = Fi l./ | Fr I, against the iteration count n 
where u;,, is the nth candidate solution from the method. 

Solve the equations in Exercise | using the Jacobi and Gauss-Seidel iterations 
for N = 10, 100. Compare the rate of convergence with the conjugate gradient 
method. 

The Symmetrized SOR or SSOR iteration is a method for solving Ax = b fora 
symmetric positive-definite matrix A. Write A = D — L — U where D is the 
diagonal part of A and L = U7 is strictly lower triangular. SSOR is a two-step 
method based on the SOR iteration (2.4.4), first going forward through the 
matrix, then backward through the matrix: 


(2.4.18) Xone. = (D+wL) | (wb — [wU + (w — 1)D] x2), 
(2.4.19) Xoe42 = (D+ wU) | (wb — [wh + (w — 1)D]x;,). 


Show that the iteration matrix Bssor where 


X2xr42 = BssorX2x +e is 
(2.4.20) Bssor = w(2 — w)(D + wU)'D(D + wL)7!. 


Show that Bssor is symmetric and positive definite provided 0 < w < 2. 


(5) Implement the SSOR iteration matrix (2.4.20) as a function 


(6 


7 


(8 


) 


Sw 


function ssoritermx(A, w, xX) 


end function 


Use this with a suitable value of w as a preconditioner for the conjugate gradient 
method applied to the problem of Exercise 1. 

The Gauss-Seidel, SOR, and SSOR iterations for solving Ax = b are all 
essentially sequential if implemented in the obvious way, because solving 
(D+wL)z = y for z using forward substitution is inherently sequential. An 
approach to speed things up comes from graph theory. The idea is to color 
the vertices of the graph of A so that any pair of adjacent vertices have dif- 
ferent colors. Show that if G = (V, E), the graph of A (n x n) has a color- 
ingc: V > {1,2,...,m} like this (@ ~ j implies c(i) € c(/)), then solving 
(D + wL)z = y for z can be performed in m parallel steps. 

Show that the graph of A in Exercise 1, being a grid graph, can be colored by 
two colors so that no two adjacent vertices have the same color. 

Polynomial preconditioners for a matrix A (nm xn) have the form B= 
p(A) where p is a suitably chosen polynomial. These can be desir- 
able over conjugate gradients for parallel computation since they do not 
require the computation of inner products u7 v, which requires global com- 
munication. Show that the spectral radius of [— BA is pW — BA) = 
max {|1 — A p(A)| | A is an eigenvalue of A }. If A is also symmetric, show that 
If — BAllp = pU — BA). 
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(9) If the eigenvalues \ of A are all real and lie in the interval [a, 3] with 0 < a, 
then we can use 


A-a 
Tn41(2>—— = 1) 


qd) = aaa 
Tmei(Qq—— ~ 1) 
1—q(A 

pop 


d 


where T,,41(s) is the Chebyshev polynomial of degree m + 1| (see (4.6.3) of 
Section 4.6.2). Check that p(A) is actually a polynomial. Obtain a bound on 
the spectral radius of J — BA in terms of 3/a. This bound must be strictly less 
than one to be useful! Note that if A is also symmetric, then G/a > K2(A). 

(10) The LSQR algorithm [195] is an iterative method for solving least squares 
problems. Implement it if you do not have an implementation in your favorite 
language. For m = 1000 and n = 100, create an m x n matrix A with entries 
sampled from a standard normal distribution. Also create two b € R” vec- 
tors: (1) b, has entries samples from a standard normal distribution; (2) for 
by first create ¥ € R" with entries sampled from a standard normal dis- 
tribution and set b) = Ax. For each b vector plot ||Ax; — djl, /||b||, and 
| AT (Axx — b) l, /(All. |B||2) against iteration count k for the LSQR algo- 
rithm. Note that for b; with very high probability || Ax* — By |l2 / |Di ll, © 1, 
while for bo, || Ax* — bp||, = 0. 


2.5 Eigenvalues and Eigenvectors 


The eigenvalue/eigenvector problem is: Given a square matrix A, find \ and v where 
Av=Xbp, vA~0. 


Computing eigenvalues and eigenvectors forms the third part of numerical linear 
algebra. This is also the foundation for methods for computing the Singular Value 
Decomposition (SVD). Eigenvalues and eigenvectors, in general, are valuable for 
understanding, for example, dynamical systems and stability, while computing eigen- 
values of symmetric matrices is important in optimization and statistics. 


2.5.1 The Power Method & Google 


The power method is the simplest method for computing eigenvalues and eigenvec- 
tors. It is shown in Algorithm 32. 
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Algorithm 32 Power method 


1 function eigenpower(A, xo, €) 
X09 <—Xo/|lxoll;_k <0; Ao <0 
while | Axx — AX | > €||xxll 
Ve — AXE 
Ak+l <— eigest(XK, Ve) 


2 
3 
4 
5 
6 Lepr — Ve/ lyx|| 
ie 
8 
9 
1 


k<k+1 
end while ~ 


return (xXx, Ax) 
0 end function 


The function eigest used in Algorithm 32 can be chosen in different ways. 
One of the simplest is to simply use the ratio (y,);/(x,);; if Ax, = Ax, then 
(y,);/(%«); = A for any choice of j. To avoid amplification of errors, we use the 
largest denominator: 

(2.5.1) eigest(x, y) =~ where |x;| = max |x¢| 
Xx; 
j 
An alternative that is most useful for real symmetric or complex Hermitian matrices 
is 


<T 
(2.5.2) eigest(x, y) = —. 
xx 
Example 2.21 As an example we will use 
+142 0 0 -1 1 
—3+1-1 0 +1 1 
(2.5.3) A=|-1-1424141], x=] -1 
—34+342+1-1 1 
—2 —3 —2 +2 —2 1 


We measure the error by the ratio | Ax — Ax l,/ \|x ||,. Note that if z = Ax — dx, 


then A — zx? / \\x II3 has \ as an exact eigenvalue and x as an exact eigenvector. The 
perturbation E = zx’ / IIx 15 has 2-norm || Ell, = ||Zll2 / ||x|lo- 


After 50 iterations, the eigenvalue estimate was © 3.2654125 which differed from 
the MATLAB-computed maximum eigenvalue by ~ 3.48 x 107°. The progress of 
the convergence is shown in Figure 2.5.1. The dashed line in Figure 2.5.1 is given 
by 4 x 0.805". 
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Fig. 2.5.1 Convergence of an example of the power method 


2.5.1.1 Google PageRank 


An example of the practical use of the power method is the original Google PageR- 
ank algorithm to rank web pages matching a given search criterion. This algorithm 
is based on a Markov chain (see Section 7.4.2.1). The Markov chain behind the 
PageRank algorithm models an indiscriminate web surfer who randomly picks links 
to follow from a given web page with equal probability. If there are no links, simply 
pick any of the very large number of web pages in existence. Web pages with high 
equilibrium probability of being viewed in this Markov chain are taken as being more 
important: there must be more links to these web pages, and many of these links are 
from other web pages that are likely to be important. Thus, the equilibrium proba- 
bilities give a basis for ranking web pages without having to somehow determine the 
quality or meaning of their content. The Google PageRank algorithm modifies this 
Markov chain by having a probability a > 0 of simply picking a web page uniformly 
from all of the possible web pages. 

If p, is the probability vector with (p,); the probability that web page i is being 
viewed after ¢ steps, then 


Pri = P p,, 


where P is the matrix of transition probabilities: P;; is the probability that web 
page i is viewed at time t + 1 given that web page j is viewed at time t. These values 
are all between zero and one. Furthermore, the total probability must remain one: 
i = |, that is, e7 p, = 1 for all t, where e = [1, 1, 1,..., 1] is the vector 
of all ones. Then 


e Dis) =e'Pp,=e'p, forall probability vectors p,. 
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That is, e7 P =e’. 

If Po is matrix of transition probabilities for picking links described above, the 
PageRank algorithm modifies this by incorporating a probability a > 0 of simply 
picking a random web page instead of picking from the links in the current page. 
The matrix of transition probabilities then becomes 


(2.5.4) P =aee'/N+(1—a)P. 


This means that each entry P;; > a/N. 

The set of probability vectors is © = { ple’p=1, p=0 | where the inequality 
“>” is understood componentwise. This set is a convex, bounded set in RY where 
N is the number of web pages. The matrix defines a function p+> P p that is 
x — &. Brouwer’s fixed-point theorem [95] shows that there is a fixed point p* 
where P p* = p*. Then p* is an eigenvector of P with eigenvalue one. Furthermore, 
there is only one probability vector with this property since every entry in P is strictly 
positive: 


IP(p-Mlh = 


N 
| 2 Puls 4 
= 
N N N 
DL? lpi ail = >. >, Pu [py - 


j=l i=l 


Ma iM= iM 


Pi -4;| =llp—all- 


— 
ll 
= 


Equality can only occur |7)_ Pi (pj — q))| a Ie PB, Dj - q;| for alli. Since 
all P;; > 0, this would mean that p; — q; have the same sign for all j. Since 
ae Pji= pane q; = 1 for probability vectors, this means that p; — q; = 0 for 
all 7. Thus, the equilibrium probability vector p* is unique. 

We can apply the power method 


Pri = Pp, 


to this matrix. Since P is the sum of a sparse matrix (the matrix of links between web 
pages) plus a low-rank matrix wee’ /N plus the matrix with column e/N for every 
web page with no out-going links, the matrix—vector product P p can be computed 
efficiently. The number of floating point operations needed is O(N + L) for each 
step where N is the number of web pages and L is the number of links rather than 
O(N’). 

To see how quickly the method converges for the transition probability matrix 
(2.5.4), 
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Poi — P*|, = |P@, - PI, 
= | Seep, — p*) + 1-0) Paty, — P*)) 
= | Sele” p, — e" p*) + 1 — 0) Poly, — P| 
= |((1 — a) Po(p, — P|, (since e’ p, = e’ p* = 1) 
<(1-a) |p, - pv’ ||,. 


So the power method in this case converges and converges at rate controlled by 1 — a. 
The number of iterations needed to give a specified accuracy € is O(log(e)/log(1 — 
a)). Google reportedly uses a * 0.15, so that they can guarantee a certain level of 
accuracy with a modest number of iterations. 


2.5.1.2 Convergence of the Power Method 


While Google’s PageRank method is guaranteed convergence because of its special 
structure, we need to investigate the convergence of the power method in general. 
We first show that in Algorithm 32, 


A‘ xo 
255 mick 
ai **= Takeo] 


where ||-|| is the vector norm used in Algorithm 32. Note that if (2.5.5) holds, then 


Xi = AX / A(A‘x/ | A‘xo|) 
Axel] JACAExo/ | A*xo])] 
Ak+lxq/ || A‘xo| Akt! xq 
~ JAR xo] /PAeeo] PAR Sco] 


so that (2.5.5) holds for k replaced by k + 1. Thus, by induction, (2.5.5) holds for 
k= 0, 152. es 

We suppose that our 7 x n matrix A has a basis of eigenvectors v1, V2,..., Up 
with Av; = A;v;. We assume that 


(2.5.6) Ar] > Ao] = [As] = +++ = [nl - 


This makes A; the dominant eigenvalue. Since {v,, v2,..., v,} is a basis for R” (or 
C” if appropriate), we can write 


Xo = CpVy_ + C2V2Q +++ + CyVy. 
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We further assume that c; 4 0. Since Ako; = Mov j, 
A¥xg = cyAbvy + cpdK v2 +++ + CAE On. 


Then 


Ak xo coy + codon +2 + en rAK vn 


Ot Takeo] Perfor Feo. Fan 


Note that 


cv + oro. fo. + cnrK vn 


xp,= 


We assumed (2.5.6), and so |A;/A1| < 1forj =2,3,...,n.Then Depaul’ — Oas 
k > oo for j => 2. Therefore 


Gf \" 
vp+ oe ah ar Vv; 
lM dj=2 cr \A | 7 v} oe) 
cin ¢ Chay oT] 
mit Dj o a vj 


Note that this result does not claim that x, converges, just that span {x,} converges 
to span {v,} in the sense that the angle between these subspaces goes to zero. 

A better way to describe the angle 6 = (span {x} , span { y}), the angle between 
span {x} and span {y}, is to note that 


1/2 


2 
>T 
. __ (ix —syllo = y 
sin @ = min = 5 5 
5 lx ll2 Ix ll5 lly lls 


yy’ | x 
= || 7 
| a a 


2 
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the minimizing s = y’ x/ Ilyll5. 
We should be concerned with the rate of convergence as well as the fact of con- 
vergence. We note that 


vu +>" c (= > 
| mea = =O | ask — oo 
er faa Ilvill ri , 
nj (Aj 
Mit Dijin a pT ay 


This means that if |A2| + |A,| then convergence can be very slow. For our compu- 
tational example (2.5.3), Az © 0.90230 + 2.468507i and |A2/A,| + 0.80488, which 
gives the approximate slope of the residual norm in Figure 2.5.1. 

Note that if |A2| = |A;|, then there is typically no convergence at all. This latter 
situation is not uncommon: for real matrices, this occurs if A, is complex, so A2 = al 
and the two largest eigenvalues have the same magnitude. Perhaps this should not 
be surprising: if A is real and Xo is real, so are all the iterates x,. We cannot expect 
convergence to a complex eigenvector. 

Also note that convergence is to the dominant eigenvalue and its eigenvector. 
Although we assume c; 4 0, this is not a strong assumption. Roundoff error makes 
it unlikely that we get c; = 0, or that this is maintained. Even if we begin with c, ~ 0, 
this will be amplified. 

If we wish to find other eigenvalues, or accelerate convergence, or deal with a 
complex conjugate pair of dominant eigenvalues, we need other methods. 


2.5.1.3 Variants: Inverse Iteration 


We can start by giving a target js for the eigenvalues: we seek to find the eigenvalue 
closest to ys and its eigenvector. While A — yJ has eigenvalues A; — p, if A; © uw 
then A; — ys is small, and so components in the direction of v; will be reduced, 
not amplified. So we use (A — j1J)~! which has eigenvalues 1/(\; — 2). Now if 
Aj © pw, 1/(A;j — p) is large. If 1 is closer to A; than any other eigenvalue of A, then 
1/(A; — }) is the dominant eigenvalue of (A — ul )—!. Applying the power method 
to (A — 1)~! should give us convergence of the eigenvalue estimates to 1/(\ nt) 
and the eigenvector to span {v j \. If.) is the estimate of the eigenvalue of (A — pl)~, 
then the estimate of A; should be jz + 1 /X. This gives us the inverse power method 
in Algorithm 33. 

The error then is expected to be O((|Azg — pL / ld; - u)*) as k — oo where A¢ 
is the second closest eigenvalue to j. 

For our example (2.5.3), taking ps = 5 + 2i so as to target A2, the convergence 
is shown in Figure 2.5.2. The slope of this plot indicates an error © C r* with r ~ 
0.3908, while |Ae — pul / |Aj — 1 0.3910. 

But this can be accelerated even more. Since we want 11 to be close to the target 
eigenvalue for fast convergence, we can use the estimated eigenvalue to give an 
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Algorithm 33 Inverse (shifted) power method 
1 function inveigpower(A, XQ, LM, €) 
x0 <—X0/Ilxoll;_k <0; Ao <0 
while ||Ax, —Axxg|| > €[lxell 
Ye — (A~ pl)! xy 


Mert <— wt l/eigest(xz, yx) 
xt — Ie/ [Yell 
k<k+l1 
end while _ 
return (Xxx, Ax) 
0 end function 


rPoOANHADUW FW ND 


Algorithm 34 Inverse iteration 
I function inviter(A, Xo, Ho, €) 


2 x0 —X0/(lxoll; k<-—0; Ao <0 
3 while ||Ax,g — pexxl| > €[lx«ll 

4 Vem (A — py 1)! xx 

5 Mey <— be + l/eigest(xx, Y,) 

6 xe — Ye/ | Vell 

7 k<k+l1 

8 end while 

9 return (xx, Uk) 

10 end function 


even better approximation to the target eigenvalue. This gives Algorithm 34 which 
is called inverse iteration. 

Results for Algorithm 34 are shown for our example (2.5.3) in Figure 2.5.2 using 
Lo = 5 + 2i. As can be seen from Figure 2.5.2, the convergence of inverse iteration 
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Fig. 2.5.2 Convergence of inverse (shifted) power method example 
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is quite rapid. In fact, it can be shown that for inverse iteration where ju, — as 
k > oo then |p) — Al = O(lu - AP) showing quadratic convergence. If A is 
symmetric and the “symmetric” eigenvalue estimate (2.5.2) is used, inverse iteration 
has cubic convergence: |j.4 — A] = O(|p%. — Al?) as k > oo. 


2.5.2 Schur Decomposition 


If a square matrix A does not have a basis of eigenvectors, we often need a way 
to find a complete eigen-decomposition of A. The algebraic answer is the Jordan 
canonical form (JCF): 


Theorem 2.22 /f A is ann xn real or complex matrix then there is a complex 
invertible matrix X where 


T J(my1, Aq) 
J(n2, 2) 
xaAx= J(n3, 3) where 


L Pare 


rAl 
J(m, X) = XO (m x m). 


We do not give a proof of this theorem. This is a difficult theorem to apply in 
practice numerically: the matrix X can have arbitrarily bad condition number. Take, 


for example, 
1) 1 
B= E 1+ -| : 


This matrix A, has distinct eigenvalues one and | + 7) for 7 4 0. Thus, apart from 
scale factors, the eigenvectors are v; = [1, 0]? and v2 = [1, n]".Let X, = [v1, sv2] 
for some scale factor s # 0. Note that the condition number is independent of the 
overall scaling of a matrix: K(aX,,) = K(X,,) for any a 4 0. Then 


_ {ifs -1_ | 1}-1/n 
“=laea|: "= [oie] © 
_ 1 1 
Foo(Xn) = |Xnloo [Xz t= max + Is|, Ism)) max + — 


inl’ snl 
>1l4+—. 
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Indeed, a thorough investigation of the numerical issues in numerically computing 
the JCF by Demmel in his PhD thesis [76] in 1983 showed that it can be extremely 
challenging to numerically compute the JCF of a given matrix. 

Instead, for numerical computation, a much better alternative is the Schur decom- 
position: 


Theorem 2.23 For any square real or complex matrix A there is a unitary matrix 
U(U-! =T. ), where 


(2.5.7) U AU = Des T is upper triangular. 


Note that the diagonal entries of T are the eigenvalues of A. 


Proof We prove this by induction on n where A is ann x n matrix. 

Base case: n = 1. In this case A is 1 x 1 so A = [a);] and we can take U = [1] 
and T = (a4, ]. 

Induction step: Suppose (2.5.7) holds whenever A isk x k; we now wish to show 
that (2.5.7) holds whenever A is (k + 1) x (kK + 1). 

We start by showing that A has a (real or complex) eigenvalue. The characteristic 
polynomial pa(z) = det(A — z/) is a polynomial with real or complex coefficients, 
and so by the Fundamental Theorem of Algebra [134, p. 309] there is areal or complex 
zero of this polynomial: p4(A) = 0 = det(A — AJ). Then A — XJ is an invertible 
matrix with real or complex coefficients, and so there is areal or complex vector v 4 0 
where (A — AJ)v = 0. Note that if A is a real matrix and ) is also real, then v can be 
chosen to be areal vector. We can normalize v so that || v II3 =v v= pa |v; |? al fe 

By means of the QR factorization of v, there is a unitary matrix Q = [q, Q2] 


where v = [g, Q2] ‘a = rq. Since ||v||, = |lq|l, = 1, Ir| = 1 and so Aq = Aq 
with g 4 0. Then 


—T —T —T 
Oo=| lapaq=| ae ee 
QAO=| Gr Ala Q]=| or a oT Ao, 


qq g' AQ» A q' AQ) 
=|—Tr, x17 = i 
Q> rq Q2 AQ 0 QO, AQ> 
since Oo q = 0 by orthogonality of the columns of Q2 with q. By the induction 


iss =T__ a es 
hypothesis, there is k x k unitary matrix U where U Oo A QU =T, an upper 
triangular matrix. Putting 
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1 eee P 
U=@Q i which is unitary, we get 


a 1 ag 1 Al 9A QU 
U AU=| x A ~ | =|sr— = 
5 | o| “| Raever 


\\q" A Q2U 
fol oF 


=T _ which is upper triangular 


as we wanted. 
Therefore, by the principle of induction, (2.5.7) holds for A is n x n for n = 
12353333 


It should be noted that the Schur decomposition gives a true eigen-decomposition in 


many cases. First, if A is a Hermitian matrix (A’ = A) or real symmetric (A? = A 
and real) then U unitary and 


U' AU =T upper triangular, implies 
—T =F T —_7—rar = 
T =U AU =U AU =U AU2=T. 


Then T must be diagonal with real diagonal entries. 
More generally, if A is normal in the sense that A’A=A A’, then 


T —T 

AAHUTO UTU =U FU UTU SUT Te while 
——_T 

AL =UTE UTU SUTU Ur Oo aur eo . 


Equating these and re-arranging gives TT =TT. .Wewillsee that this implies T 
T — — 
is diagonal . Write T = |” a I Then 7 T=TT’ implies that 


7 r|t? eb cad 
_|aT ~ |= lista and 
t|T T vt|T T+tt" 


= =T 
| T In? + Wells] 7 
|| ser |= — —— are equal. 
t|T Tt TT 


Then ales = 0 and so t= 0. Applying the argument inductively to T shows 
that T must be diagonal in this case, and we have a complete orthogonal eigen- 
decomposition of the matrix A. 
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2.5.2.1 Continuity and Perturbation Theorems for Eigenvalues 

The set of eigenvalues, or spectrum, 

(2.5.8) o(A) = {A | A is an eigenvalue of A } 


is continuous in the entries of A in the sense that for any « > O there isad = 6(A) > 0 
where for any E with ||E||,, < 6 implies that every \ € (A) is within distance € of 
some p: € 0(A + E), andevery pw € o(A + E) is within distance € of some X € o(A). 


Theorem 2.24 For every n, the spectrum o(A) of n x n matrices is continuous in 
the sense described above. 


Proof (Outline) This proof requires elements of complex analysis (see, for example, 
[2]). Suppose f isa function thatis analytic inside and ona simple non-self-intersecting 
curve C going once counter-clockwise around a region R (with C the boundary of 
R), then the number of zeros of f counting multiplicity is (27i)~! fc [ POs (z)] dz 
provided f has no zeros on C. Now suppose g is another function that is also ana- 
lytic inside and on C where |g’ (z)/g(z) _ f'@/f | < 27 length(C) for all z on 
C. Since the value of (27i)7! fo [9/(z) / g(z)| dz must also be an integer and differs 
from (27i)~! $c [ f’(z)/f (z)] dzby less than one, the two integrals must have the same 
value. Thus f and g have the same number of zeros counting multiplicity in R. 

Note that f’(z) = trace [(az - A)"'] det(zJ —A) and so f'(z)/f()= 
trace [(zI _ A)7"]. Also, by the formula for the induced oo-norm |trace[B]| < 
1 ||Blloo- 

If A € o(A), let C),. be the circle of radius € centered at \ going once counter- 
clockwise around 4. Without loss of generality, suppose that € is less than half the 
distance from . to the nearest different eigenvalue of A. Choose 


1 
2 max [2n |@r— A], 


(I AY! 12 Cae} 
Then provided || E||,, < 6 we have 
|trace [(<I -A- E)"'] _ trace [(zI _ A)~']| 
<n|(@I-A-E)'-(@I-A)'|, 
<n |Ello [1 -A- ED], [2-4 


_yy2 
J@r- Ay. 
1-llEllo|l@Z-A“],, 2 


SN|Ello 


Thus the number of eigenvalues of A+ F in C),, is the same as the number of 
eigenvalues of A in C),, counting multiplicity. So every eigenvalue of A must be 
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within € of some eigenvalue of A + F. Since the number of eigenvalues of A + E 
counting multiplicity is also n, every eigenvalue of A + E must be within € of some 
eigenvalue of A. 


Perhaps the most celebrated eigenvalue “perturbation” theorem is the Gershgorin 
theorem for perturbations of a diagonal matrix: 


Theorem 2.25 (Gershgorin’s theorem) If A is a complex n x n matrix, then every 
eigenvalue of A is in the union of disks 


Dj, = z2eC||z—aj|< Yo laje| ‘ JH 1,2,...,0, 
kk#j 


in the complex plane. Furthermore, if a subset of m of these disks do not intersect any 
other of these disks, then the union of these m disks contains exactly m eigenvalues 
counting multiplicities. 


Proof For the first part of the theorem, suppose Av = Xv with v # 0. Let @ be an 
index where |ve| = max; |v; | = |lv||,,.. Then 


(ae — vel =| D5 aeave| < D> lace! vel < | YS laeel | lvel . 


k:k#e kik#e kik#e 


Dividing by |v,| gives the bound |a¢, — A| < ease |ae.|. Thus A € Dy. Repeating 
this for every eigenvalue shows that every eigenvalue of A is in the union Uj=1 Dj. 

For the second part of the theorem, let A (r) be the matrix given by a; (1) = aj; 
while if j 4 k we set ajx(r) = r ajx. Clearly A(O) is the diagonal part of A while 
A(1) = A isthe original matrix. Let Bj(r) = {2 €C | |z-4jj| $7 Deng, laiel }, 
which are the Gershgorin disks for A(r). Ifr <r’ then D ir) C D j(r’). Suppose 
{D Fig lDiys: aha DD int is a collection of Gershgorin disks that do not intersect any 
other Gershgorin disk of A. Then {D iW), D (Fs «625 D ‘im (r)} isa collection of 
Gershgorin disks that do not intersect any other Gershgorin disk of A(r). If 0 < 
ée< Ming, ¢ |axx — Aee|, then the number of eigenvalues of Ale) in Di, (€) U AO U 
-.-U Dj, (©) is exactly m. Because of the continuity of the spectrum (Theorem 2.24), 
Dj, (r) U Dj, (r) U--- U Dj, (7) has exactly m eigenvalues counting multiplicities for 
all0 <r < 1. Setting r = 1 gives the conclusion we wish. 


A related perturbation result is also straightforward to prove: 


Theorem 2.26 (Bauer—Fike perturbation theorem) If A is ann x n complex matrix 
and X~'AX = diag(A1, 2, ---, An) then for any p > 1, and pw € 0o(A+ E) we 
have the bound 


i A p| < X) |/Ell,. 
jemin, A Hl < 8p Ely 
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This relates the bound on the perturbation of the eigenvalues in terms of the 
condition number of a matrix of eigenvectors X. 


Proof First note that o(A) = {Aj, 2, ..-, An}. Let D = diag(\y, 2, .--, An), 
and F = X~'EX. Since A+ E — pI is not invertible, neither is X~!(A + E — 
p1)X = D+ F — pl. Therefore, 

1<|D-peD'F|,<|O-HD)"|, IF lp 


|= ay", |X", [Xp WE, 


IA 


< —————§~¥ £,, (X) ||E]l,. 
minyeoay [A— pl? : 


Re-arranging gives min):eo(a) |A — LS Kp(X) IIE llp- 
A perturbation theorem for eigenvalues of symmetric or complex Hermitian matrices 
is the Wielandt—Hoffman theorem which is the subject of Exercise 3. 

Eigenvectors can change discontinuously for repeated eigenvalues, even for sym- 
metric and complex Hermitian matrices, so any perturbation theorem must take this 
into account. 


Theorem 2.27 Suppose A (n x n) has a simple eigenvalue with normalized eigen- 
vector v and A’ has a corresponding eigenvector z (A’z = \z). Then A+ E has 
an eigenvector w where 


A-XI]* 
at 
2 
Note that 6 is the angle between v and Z. 
Proof Let A(s) = A+ sE wihE=E / \|E|l2, and consider the smooth equations 
A(s)v(s) = A(s)v(s) and || v(s) II5 = |. Under our assumptions have locally unique 


and smooth solutions. Differentiating these equations with respect to s and substi- 
tuting s = 0 gives 


|[w—vll2< 


1 
(: as 7 Ell, + O(EI3) as Ell, > 0. 


Ev+Au=pv+u and 


vu =0, 


where u = dv/ds(0) and «4 = dX/ds(0). Re-arranging the first of these equations 
gives (A — AD)u = (ul — E)v. Pre-multiplying by z’ gives 0 =z! (ul — E)v. 
Solving for ju gives pp = z! Ev/z! v. Therefore, (A — AD)u = [vz! /(z? v) — IE v. 
Note that the right-hand side is orthogonal to z since z’[vz’/(z’v) — J] = 
(z? v)z? /(z7 v) — z? = 0. The null space of A? — AJ is span {z} and v ¢ 0, so the 


T 
null space of i og is span é | . By Theorem 8.6, the range of z a 


is the orthogonal complement of span . Therefore, 


Zz 
0 
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eal eee ee 


v? 0 


has a solution u, which is given by 


me sal eee - ad 
u= 7 0 . 


The 2-norm of uw is bounded by 


A-d]" 
yl 
2 
Note that |], = 1, loll =1, and oz” | /[z7v| = toll Uzll2 /(leos 6! olla Illa) = 


1/|cos 6|. We then obtain the desired bounds with w = v + su + O(s7) using s = 
I| Ello. 


vz 
l¢llo < 


ral, +1) Mbit. 


Note that if A is symmetric then we can take z=v, and note that 
vv" /(v v) — 1||, < 1, to obtain the improved bounds 


E 
Iw — oll, < #2 + OWEN. 
MUN eo(A), pF |v = LA 


Even for real symmetric matrices, the eigenvectors of nearly repeated eigenvalues 
are not stable. Consider the perturbation of 


nee] » a= [the 


This perturbation has size O(e), which is the same order as the separation of the 
eigenvalues. The eigenvalues of A’ are 


14/5 


At =lt+e 


while for an eigenvector v = [v, vo] of A’. with eigenvalue \_, 


Laa/5 
ao 


v2/V, = 


while for any eigenvector v = [v, v2)" of A, with eigenvalue |, v2/v; = 0. In these 
cases, if A= A, and E = A’ — A, we have both minyeo(A), 4 |A — | = € and 
Ells =e. 
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Algorithm 35 Orthogonal iteration 


1 function orthogiteration(A, U®,m) 
2 for k=0,1,2,...,m—1 

3 V® — Au® 

4 V® = U&*D RUF) (OR fEactorization) 
5 end for 

6 return Ut) 

7 end function 


2.5.3. The QR Algorithm 


While the power method and the related methods in Section 2.5.1 can be effec- 
tive at finding a single eigenvalue and its eigenvector, when we want to find a 
complete eigen-decomposition, we need something better. That better algorithm is 
widely understood to be the QR algorithm, developed independently in 1961 by John 
G. Francis and by Vera Kublanovskaya [94, 148]. 


2.5.3.1 Orthogonal Iteration 


The starting point for these algorithms is orthogonal iteration: For ann x n matrix A, 
start with ann x n orthogonal matrix U©. The algorithm is shown in Algorithm 35. 
Note that AU = U“HYV RED g09 
At y = Ak} A yO = Ak} UY) RY 
= Ak-2 y® R® RO 


= yu RY an R® RY, 


Since the product of upper triangular matrices is upper triangular we can write 
AKU =U RM! where REI = R®... RO R®, Setting UY = [w), UM] and 
partitioning R consistently gives 


() | pT 
UPR® =[u, FP] PPP Ot = [pPuh, TORO 4 ur VT | 
: | RY @ 
AUS) =A [uS-, UU) ] =[Aud-), A go) ] 
so the first column of UY), plu = AuY—. So the first column of U“? acts as if 


the power method is being applied to it. Since U” is orthogonal its inverse is equal 
to its transpose. Note that 


(RS)? = (RS)")? = (STR) = RTS, 
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So 
AT US) = UD (RYY-T, 


But here, (R“)~? is lower triangular. Writing U) = [U, u | and partitioning 
(R)-T = L consistently we get 


Pree LY”) $3 teat, oe ee 
[0° al] gfe] = [90+ er wa 


- [47 OU-D, Aa-T a | ; 


This means that the /ast column u¥!’ of U“” essentially has the power method for A~? 
applied to it. So uy’? should converge to an eigenvector for the smallest eigenvalue 
of A’. 

After the first column of U“” has converged to a) v;, with v, the unit eigenvector 
for the dominant eigenvalue of A, the second column of U‘ can be thought of as 


undergoing the iteration us = 4PiA us) 7 | PiA uy | where P, is the orthogonal 
2 


projection P; = J — v, v0]. This again works like the power method using the matrix 
PA. Assuming the power method for this matrix converges, the limit is an eigenvector 
v2 of PA: PAvz = Arv2. Then Av2 — v0; Av2 = A2v2 and so Avy = 4,20) + 
A2v2. That is, 

Al rl 


A[v,, v2] = [v1, v2] re 


After convergence of the first two columns of UY), the third column essen- 


tially undergoes the iteration uy = +P, Auy! : ip | P Auy : | where P7 = I — 
: 2 


[v,, vo] [v1, vel’, noting that v2 is orthogonal to v;. The limiting eigenvector 
(assuming convergence) is then v3 where P,;Av3 = A303 which means that Av3 = 
141,301 + [2,302 + A303, that is, 


At f1,2 113 
A[v1, v2, v3] =[v1, v2, v3] Az [42,3 
A3 


Continuing in this way we can justify the claim that each column uv : approaches 


span {ve} as j — oo where the v,’s are orthonormal and 
Al[vy, V2,..., Vpn] = [UV], V2,..., V2] T with T upper triangular, 
at least provided |A,| > |A2| > --- > |A,|. We should be careful about this line of 


reasoning as it gives the impression that the convergence is sequential, first for the 
first column, then for the second column, and so on, when in fact it is simultaneous. 
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Fig. 2.5.3. Norms of sub-diagonal columns for orthogonal iteration 
Table 2.5.1 Eigenvalues of A from (2.5.3) 
k Re rK Im Xk \Ax| 
1 +3.26537768 0 3.2653777 
2 +0.90230447 +2.468506906 2.6282465 
3 +0.90230447 —2.468506906 2.6282465 
4 —1.03499331 +2.371325340 2.5873529 
5 —1.03499331 —2.371325340 2.5873529 


The quantities that are most important are the ratios |A,+,| / |A,|. The smaller these 
are, the faster the convergence. 

For the example matrix in (2.5.3), the norms of the sub-diagonal columns of 
UT AU) are shown in Figure 2.5.3. To understand why the method behaves this 
way, we need to know the eigenvalues and their magnitudes, which are shown in 
Table 2.5.1. 

Note that the eigenvalue magnitudes are |A;| > |Az2| = |A3| > |Aa] = |As|. So we 
do not expect by, ) or be to go to zero where BY) = U“)™ AU), Furthermore, 
the rate at which pi) — Oas j — oo is controlled by |A2| /|A| + 0.805, while the 
rate at which ae — Oas j > ow iscontrolled by |A4| / |A3| © 0.9844. This clearly 
indicates much faster convergence for by ; (and other entries below the (2, 1) entry) 
than for by .) 

Justification for an asymptotic convergence rate of |A,+4;| /|A,| in the entries can 
be found, for example, in [105, Sec. 7.3.2, pp. 332-334]. 
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Algorithm 36 The basic QR algorithm 
1 function QRalgorithm(Ag, m) 


2 Uo <T 

2 for j=0,1,...,m—-—1 

3 Aj =Qj;R;j // QR factorization 
4 Ajai — Rj Qj; Ujsr — Uj Qj 

5 end for 

6 return (Am, Um) 

7 end 


2.5.3.2 The Basic QR Algorithm 


The basic QR algorithm is shown in Algorithm 36. 

At first it can seem that the QR algorithm has nothing to do with finding 
eigenvalues and eigenvectors. But the connections become apparent with some 
analysis. Since A; = Q;R; from the QR factorization, we have R; = 0; A j SO 
Aju = Rj QO; = O; As Q; so that Aj, is unitarily similar to A;. Thus Aj;,, and 


A; have the same eigenvalues with the same multiplicities. If U; Ao U; = A; then 
ST TT —— 
Ajai = Q; AjQ; = Qj Uj AoUjQ; =Uj41 AoUj+.- 


Since Up = J, mathematical induction lets us conclude that U7; Ao U; = A; for j = 
0, 1,2,..... This is closely related to orthogonal iteration: AU) = UUt) RUT) 
(QR factorization). In fact, we will now show that U“ from orthogonal iteration is, 
under a suitable restriction on the QR factorization used, identical to the U; matrix 
obtained from the QR algorithm for j = 0,1, 2,.... 


Lemma 2.28 Suppose the QR factorization algorithm is chosen so that the R matrix 
has only non-negative entries. Then provided A = Ao is invertible, we have U = 
U; for j =0,1,2,... where U\ is the matrix obtained from orthogonal iteration 
in Algorithm 35 with U® = I, and U; the matrix obtained from the OR algorithm 
in Algorithm 36. 


Proof First note that U© = Up = I. This is the base case for our induction proof. 
Suppose that UY = U; and A = Ag. Then A; = Q;R;. On the other hand, 
—T —rT * —xT * : 
A; =U; AU; =UM AU =UD YITYRITD, 50 
Q;R; =UD YU) RUD, 
By uniqueness of QR factorizations for invertible matrices, there is a diagonal matrix 
. 7 ae = 
Dj; with diagonal entries e!’, where U\)) UY*+) = Q;D; and DjR; = RY*”. To 


simplify the analysis, we can assume that the QR factorization is implemented so 
that the diagonal entries of the “R” matrix of the QR factorization are positive. Then 
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——T . * 
D; =1. Also, UY) UY*) = Q; and R; = Rt". The first equation means that 
USD = yW Q; = U;Q; = Uj+,. This completes the induction step. 
By mathematical induction, UY) = U; and RY*” = R; for j = 0, 1,2,.... 


Lemma 2.28 also implies that the convergence rate of the basic QR algorithm is the 
convergence rate of orthogonal iteration. As we have seen, this can be very slow. 
If we view orthogonal iteration as a generalization of the power method, then we 
can see that shifting the eigenvalues can be a very effective strategy. It can also be 
applied to the QR algorithm. We will see how in the next section. 


2.5.3.3 Shifting 


Shifting can be applied to orthogonal iteration and the QR algorithm by replacing A 
with A — J and A; with A; — yJ for computing the QR factorization. The shifted 
orthogonal iteration is (A — p1)UY) = UY*+) RY” and the shifted QR algorithm 
step is A; — pl = Q;R; and Aj; <— R;Q; + pl. The matrix uJ must be added 
back in to restore A j+1 to be similar to A;: 
——T Tr ——T 
Rj = OF (A; — pl) = OF Aj — QO; ; and so 
—T —T 
Aj+1 = RjQj) + ul =(Q; Aj — HQ; DQ; +p 
——T —T 
= Q; AjQ;—-HI +I =Q; AjQ;. 
As with the basic QR algorithm, we suppose that U; = U). We want to show that 
Uj+1 = UUtD, Since Aj = pl => Q;R;, 


5 —.T F —_—T 3 
Aj — pl =U) (A—p)U; =U9 (A=—pDUY =UD UY RY, 50 


e 


Q;R; = UD YSt) RUD, 


Again assuming that the “R” in the QR factorization has positive diagonal entries, 
we have R; = RY” and UT) = UY Q; = Uj; Q; = Uj+1, as we wanted. 
The real advantage of shifting is not what happens to the first column of U‘? 


where ae? =(A- uDu? / | (A — ulu\? | . Rather, it is the ast column where 
2 


Cer 
[amon a 


that features the fastest convergence: if ~ is close to an eigenvalue of A then the 
component in the direction of the corresponding eigenvector is greatly amplified. 
Estimates of an eigenvalue can be obtained from A; by using the bottom-right entry 


(Apan = ul! T Au G ) This gives superlinear convergence of the eigenvalue estimate 
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Algorithm 37 The QR algorithm with shifts 
1 function QRshift(Ag, m) 
U<1; w<O0 
for j=0,1,...,m—1 
Aj —pwl=Q;R; // QR factorization 
Ajti — Rj Qi + hl; Ujsi — Uj Q; 


(Apn-1n-1 (Aj)n=1.1 
(Aj)n.n-1 (Aj)nn 
end for 

return (Am, Um) 


2 
2 
3 
4 
5 fu<smallest eigenvalue of 
5 
6 
7 end function 


as for inverse iteration (Algorithm 33). If the matrix A is real, this method will still 
give only real shifts, and will not make explicit any complex eigenvalues. A better 
strategy is to use one of the eigenvalues of the bottom-right 2 x 2 submatrix 


(A j)n—-1,n-1 (As)n—1,n 
(Aj)nn-1 (Aj)n.n , 


When the ratio between the largest and smallest eigenvalues |A;| / |A,| is large, 
then it is important to use small shifts. Initially, zero shifts may be initially desirable 
so that the bottom-right entries are small before we use them to compute shifts. A 
pseudo-code for a QR algorithm with shifting is shown in Algorithm 37. Results for 
this algorithm applied to our test matrix (2.5.3) are shown in Figure 2.5.4, showing 
the convergence of selected sub-diagonal entries. 

As can be seen in Figure 2.5.4, after some initial “dithering”, the as4 entry goes 
to zero quite rapidly. If we continued the graph, then the size of as4 goes well below 
unit roundoff; in fact, in 20 iterations its value is ¥ 2.16 x 107227. The aa3 entry also 


Fig. 2.5.4 Results for QR 
algorithm with shifts applied 
to (2.5.3) 


1 920 A 1 1 
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Iteration 
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Fig. 2.5.5 Results for the 
QR algorithm with shifts and 
deflation for (2.5.3) 


0 5 10 15 20 
Iteration 


decreases in size dramatically in 20 iterations, but a32 and a2; are reduced moderately 
(by about a factor of 10). It is the targeting of the eigenvalue of the last 2 x 2 matrix 
that gives us this dramatic reduction in as5q. 

Since convergence of the bottom-right entry (A;),,,, to an eigenvalue is fast, 
once this eigenvalue estimate is sufficiently accurate, how can we accelerate the 
convergence of the other eigenvalue estimates? The answer is deflation. This is where 
some very small entries are set to zero, and the connection between this eigenvalue 
and the rest of the matrix is severed. For example, when the entries (A j)n,, fork <n 
of the last row are no more than perhaps unit roundoff u times (A;),,,, then we can 
set (Aj)n,~¢ <- 0 fork = 1,2,...,n — 1. If we do this, then (A;),,, is an eigenvalue 
of A; and it is decoupled from the rest of the matrix. 

We can then start processing the top-left (n — 1) x (nm — 1) submatrix of A; and 
start targeting its bottom-right 2 x 2 submatrix. As we do this, we should remem- 
ber that the last column of A; will be modified with each iteration, even though 
the orthogonal matrices we apply to it will be determined by the top-left (n — 1) x 
(n — 1) submatrix of A ;. Results of this shift-deflate-and-retarget strategy are shown 
in Figure 2.5.5. Note that the matrix was completely deflated in 21 iterations. This 
method is sequential in the sense that it can only target one eigenvalue for accelerated 
convergence at a time. Nevertheless, the matrices A; converge rapidly once it starts 
to approach moderately accurate estimates. 


2.5.3.4 Speeding QR Iterations 


Shifting and deflating make the method converge rapidly in terms of the number of 
iterations. But we can also make each iteration fast. The cost of a QR factorization 
is O(n*) for an n x n matrix. However, with a single O(n") pre-processing step, 
we can do each subsequent QR factorization in O(n’) operations. Furthermore, for 
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Algorithm 38 Constructing Hessenberg matrix H that is unitarily similar to A 
1 function mkhessenberg(A) 


2 Q<I 

3 a, <—[ai2, ay3, ..., ain]? 

4 v', < housholder(a\); v1 <[0, vi yt 
5 WW 2Wol v4); Pr ol—yv0ef 

6 A<P AP; 

7 (H’, Q’) <— mkhessenberg(A2:n.2:n) 


S heel ty al } e<al! 
rou Q'|' Q’ 


9 return (A, Q) 
10 end function 


matrices of special structure, such as real symmetric or complex Hermitian matrices, 
this pre-processing step means that each subsequent QR factorization can be done 
in only O(n) operations. 

Since the total number of iterations, assuming that we get good convergence for 
each targeted eigenvalue, is O(n), we get a total of O(n?) operations for the Schur 
decomposition of a general n x nm matrix. 

To get rapid QR factorizations, we need to pre-process the matrix A into Hessen- 
berg form. That is, we want to find a real orthogonal or complex unitary matrix 
Q so that H = OA Q is Hessenberg, that is, where hj; =O if i > j +1. In 
other words, in H, all non-zero entries occur on or above the first sub-diagonal. 
If A is real symmetric or complex Hermitian, then H must also have this prop- 
erty, and so H must also be tridiagonal. We can compute H using Householder 
reflectors P = I — 2uv" / (v' v) as are used to compute QR factorizations (see 
Section 2.2.2.4). We must apply these orthogonal matrices in a symmetric fash- 


ion to A, PA P=PAPas Pp’ = P. We choose v so that if a = eB is the first 


column of A, v = H where (I — 2Qv'v" /(v" v'))a’ =7|\a' ||, e1 with |y| = 1. 


Applying P’ A will zero every entry in the first column of A except for the first two 
entries. The first entry will be unchanged, but the second will be 7 lla’ | >- Applying 


the Householder reflector symmetrically to the columns PA P, we note that the first 


column of P’ A is unchanged, preserving the zeros that had just been introduced. 
Repeating this process recursively to the bottom-right (n — 1) x (m — 1) submatrix 


of PA P will give us a Hessenberg matrix. This is Algorithm 38. 

The real advantage of this pre-processing step is how it expedites the QR factor- 
ization. The QR factorization of a Hessenberg matrix can be computed by means of 
Givens’ rotations (see Section 2.2.2.5) as shown in Algorithm 39. 

The Q matrix in the QR factorization of a Hessenberg matrix is also Hessenberg. 
This can be seen from the fact that the Q matrix is a product of Givens rotations: 
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Algorithm 39 QR factorization of a Hessenberg matrix 
al function QRhessenberg(H) 
Q<I 
for k=1,2,...,n—1 
(c, 5) <— givens(hrk, hk+1,k) 


c |+s 
Qi:nkecktl << Oink Erm 


Eo 8: 
Aek+ Alin < -] Ae:k+1,1:n 


+s] c 


ul B WD 


end for 
return (H, Q) 
end 


oon oo 


KOK 


Since Q is Hessenberg and R is upper triangular, RQ is also Hessenberg: 
(RQ)ij = yee ric; and the sum is taken over all indexes k wherei < k < j + 1; 
there can only be a non-zero sum when i < j + 1, showing that RQ is Hessenberg. 

We can use the Hessenberg structure to create chasing algorithms that compute 
the QR factorization and the RQ product simultaneously. If G1, G2,..., Gy_1 are 
the Givens rotation matrices used in the QR factorization of Algorithm 39, then 


R2G6.4<<G0 1H 000 = 6,116, =G, @ %1G,4. ; From this, 
RQ = Gy_1--:GoG, HG, G2. see CG which is another Hessenberg matrix K. 


The matrix G,; H G, is almost Hessenberg: (G; H Goa may be non-zero. Since Gz 
differs from the identity matrix only at entries (i, j) with i, 7 € {2, 3}, it is possible 
to “patch” this violation of being Hessenberg with G2 by zeroing out the (3, 1) 


entry using the (3, 2) entry. The resulting matrix G2G,; H CG is again almost 
Hessenberg because (G2G, H Gas as might be non-zero. This time this can be 
“patched” by using G3 to zero the (4, 2) entry using the (4, 3) entry. Continuing in 
this way we “chase” the non-Hessenberg entry finally out of the matrix using Gy_| 
to zero out the (n, n — 2) entry using the (n, n — 1) entry. Figure 2.5.6 illustrates the 
overall process. 

This gives us a new Hessenberg matrix, but is it really (essentially) the same as 
what we would get using the QR factorization? The answer is yes, due to the Implicit 
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Fig. 2.5.6 Hessenberg chasing algorithm 


Q theorem. We say ann x n Hessenberg matrix H is unreduced if hx+1,4 4 0 for 
k=1,2,....n—-1. 


Theorem 2.29 (Implicit Q theorem) Suppose that B is a real or complex matrix and 
there are matrices Q and V both orthogonal if B real, and unitary if B complex, 
such that both OB Q and V’ BV are unreduced Hessenberg matrices. If the first 
columns of Q and V are the same, then for every j there is anumber ly; | = | where 
Qe; = y; Ve;. Thus there is a diagonal matrix D with diagonal entries of magnitude 
one where 

O BO=D V BVD. 


For a proof, see, for example, [105, Thm. 7.4.2, pp. 346-347]. 

In our situation, provided H is unreduced Hessenberg, then the result of the chas- 
ing algorithm computing the Givens rotations sequentially is essentially equivalent 
to the original QR factorization approach. In fact, because two QR factorizations 
B= Q,R, = Q2R> with Q; and Q> real orthogonal or complex unitary are related 
by Q; = Q2 D and R; = DR; where D is diagonal with diagonal entries of mag- 
nitude one, the magnitudes of the entries of H do not depend on how that QR 
factorization is performed. 

On the other hand, if H is Hessenberg but not unreduced Hessenberg, then some 
hy4i,k = 0, and the matrix can be deflated. 

The remaining question is how to compute G}. If the shift is 2 we want to choose 


G, so that 
Le] E a "T= [6] 
—S{ Cy hy, 0|° 
The remaining G,’s are computed by the chasing algorithm outlined above. This 
gives a fast O(n”) algorithm of performing the shifted QR step. 
It should be noted that the chasing algorithm outlined is not numerically stable 


in the sense that small errors or perturbations in G; can result in large errors in 
the following G,’s, especially if there is some hy+1,, ~ 0. Fortunately, this only 
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Fig. 2.5.7 Tridiagonal chasing algorithm 


affects the rate of convergence of the method, not the orthogonality of the resulting 
Q matrix, nor the accuracy of the similarity f(Q;_A j Qi) 0) A j Qj using the 
chasing algorithm. 

A special case is where A is real symmetric or complex Hermitian. Then the 
Hessenberg matrix computed by Algorithm 38 is tridiagonal. Furthermore, we can 
perform the QR algorithm step using a streamlined chasing algorithm as illustrated 
in Figure 2.5.7. 

There is a variant of the QR algorithm for general real matrices that avoids complex 
arithmetic. It does not produce a Schur decomposition, as that would necessarily be 
complex if the matrix has complex eigenvalues. Instead, it computes a real Schur 
decomposition 


O'AQ=T 


with Q real orthogonal and T real block upper triangular with | x 1 or 2 x 2 blocks. 
Each diagonal 2 x 2 represents a pair of complex conjugate eigenvalues. The most 
difficult part of implementing the QR algorithm for this case is dealing with complex 
shifts. By appealing to the implicit Q theorem (Theorem 2.29) we can create a chasing 
algorithm. If we want to use a complex shift 1 = p + io we actually need to also apply 
the complex conjugate shift 2 = p — io in order to keep the arithmetic real. Applying 
this to areal Hessenberg matrix H we need to find the first column of (H — pJ)(H — 
pl) = H? - 20H + (p" + o07)1. When we compute (H — u/J)(H — pl )e; we get 
a vector with (in general) three non-zero leading entries: 


ht, + hizha —2phi, + p? +07 
hy (hy + ho — 2p) 
hah32 
0 


ON }S & 
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We need to use Householder reflectors with v vectors of length three, rather than 
Givens rotations, to zero out all but the first entry. The resulting chasing algorithm 
now has three “non-Hessenberg” entries in a triangle to chase down and out of 
the matrix. This means we use 3 x 3 Householder reflectors throughout to do the 
“chasing”. Remarkably, Francis worked out all these details in 1961. For full details, 
see [105, Sec. 7.5, pp. 352-361]. 


2.5.4 Singular Value Decomposition 


The singular value decomposition (SVD) of an m x n matrix A is the factorization 
=r 
(2.5.9) A=UXV , 


where U and V are real orthogonal if A is real or unitary if A is complex, and & is 
m xX nand diagonal with diagonal entries 0) > 02 > --- > Omin(m.n) = 0. The values 
a; are called singular values of A. The singular value decomposition is a valuable 
tool in many areas including statistics where it forms the computational foundation 
of principal components analysis (PCA). It can be used to define the pseudo-inverse 
of a general rectangular matrix. The SVD is key part of many dimension reduction 
algorithms. It is a common tool in certain data mining tasks, such as in recommender 
systems. 


2.5.4.1 Existence of the SVD 


Theorem 2.30 The singular value decomposition (SVD) exists for any m xn 
matrix A. 


Many discussions of the SVD point out that A’ Aisann x n positive semi-definite 
Hermitian or real symmetric matrix, and so has non-negative eigenvalues which can 
be ordered from largest to smallest Ay > Az > --- > A, = 0. Anorthonormal basis of 
eigenvectors can be used to form the V matrix, and we take 0; = VJAj . Animmediate 
consequence is that the singular values 0; are indeed functions of A alone, even if 
there can be different singular value decompositions of A. These differences can 
only occur in U or V when there are repeated eigenvalues. 

To complete the existence argument we need to construct the U matrix. We can 


apply the same arguments to AA’ for U as we applied to A’ A for V. But this is 


: : =T <T 
not sufficient. We need to show that the non-zero eigenvalues of A A and AA _ are 
the same. Also, we need the U and V to be consistent with each other, which might 
not occur from the previous arguments if there are repeated eigenvalues. Finally, 


working with A’ Aor AA’ numerically is not a good approach because the error in 


An is OCU [a"a] = Ou |All3)s taking square roots amplifies this error if X,, is 
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small. We use a different approach in the proof that better reflects the computational 
and numerical issues. 


Proof Let 


This is a complex Hermitian or real symmetric matrix. As such there is a complex 


unitary or real orthogonal matrix Z where ZC Z is real diagonal with the diag- 
onal entries in decreasing order. Suppose that \ 4 0 is an eigenvalue of C with 
eigenvector z 4 0. Then we can split z consistently with the block structure of C: 
z=[u', v7] //2. Then 


fat |[s]-[5] 


Thus eigenvalues come in pairs (A,—A) with eigenvectors [u*, v 
[u? , —v’]". By orthogonality of eigenvectors with distinct eigenvalues, 


——_T 
=), |, [ata are, 
v —v 


That is ||u||2 = ||v2|| = 1. Furthermore, if z;, z, are distinct columns of Z with 
eigenvalues \;, Ay, > 0 then splitting z; and z, in the same way as above gives 


——_T ——=T 
= uj; u uj; u 
0 — Zk — J k — J k % 
vj VE vj —UzE 
That is, 0 = Wj" uy + U7" Uy — Wj! uy _ v;' v,.Then0 rs Wj" uy = vj" v;. Thus the 


sets { uj; |A;>0 } and { v;|A; >0 } are sets of orthonormal vectors. We also need 
to look at eigenvectors with eigenvalue zero: 


eA |E]-9[]- 2) 


so Av = 0 and Au = 0. The dimension of the set of all such vectors [u7, v 


and 


Tq: 

|’ is 
the sum of the dimensions of the null spaces of A and A’. So we can add a basis 
for the null space of A for the v’s and the null space of A’ for the u’s. Any u in the 
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null space of A’ is orthogonal to uv, with A, > 0. Let r be the number of positive 
eigenvalues of C. Then we set 


U=([u, u2, vee, Up, Uypti, sc ’ Un), 
V=[v,, V9, ---, Ur, Ur+i,°°° Vil, 
: —T 
where W;4,..., Um form an orthonormal basis for null(A_) and v,41,..., v, form 


an orthonormal basis for null(A). Then Av; = A;u; and Au j = Aj;0; forall appro- 
priate j. Leto; = A; where \; > 0. Then 


U Av; = dje; = Le; for 7 =1,2,...,n 
VA uj =rje; =X e; for 7 =1,2,...,m 


That is, UA V=ZXandsoA=UX ae as we wanted. 


2.5.4.2 Uses of the SVD 


The pseudo-inverse of a matrix A (2.2.4) can be generalized by defining the pseudo- 
inverse as 


(2.5.10) Atav=tT’, 


where &* is then x m diagonal matrix with (Z*) ;; = 1/o;ifo; > Oand(X*)j;; = 
O if o; = 0. With this definition, the solution of min, || Ax — b||, with the minimum 
value of ||x||, isx = ATD. 

The rank of A is max 1s | aj > O}. The 2-norm of A is || A||, = o;. On the other 


1 
hand, the Frobenius norm ||A]|- = » j o3| . The singular values are invariant 


under multiplication by real orthogonal and complex unitary matrices: 0;(QA) = 
aj(A 0) = = ¢;(A) for real orthogonal or complex unitary matrices Q and Q. 
The closest matrix to A in the ||-||, norm that has rank r < rank(A) is 


(2.5.11) A, = [Wy, Uo,..., Uy] . [v},V2,...,0;] . 


This means that we look for low-rank approximations to a given matrix using the 
SVD. 
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2.5.4.3 Computing the SVD 


As we have seen in the proof of the existence of the SVD (Theorem 2.30) and from 


=r —=T : : 

A A=V(xX‘X)V_ thatthe SVD is closely related to the real symmetric or complex 
Hermitian eigenvalue problem. For convenience of analysis, we expand either the 
number of rows or number of columns with zero entries so that the matrix A is square. 


A I 
The basic QR algorithm can be applied to C = ar | Let J = Ea which 
swaps the block rows or columns and is a real orthogonal matrix. Assume that m > n. 


<T 
Then JC = Ee The QR factorization of JC is given by 


ele eal 


with P and Q real orthogonal or complex Hermitian and R, S upper triangular. Then 
the result of one step of the QR algorithm is 


= [5] far] Fa] = bse] 


PAO 
= ap| : 


Note that A = Q'S is the QR factorization of A and A’ = PRisthe QR factorization 
of A. To accelerate each iteration of the QR algorithm, we transform the matrix A 
into bidiagonal form. 

For computing eigenvalues in the symmetric or Hermitian cases, we first reduce 
the matrix to a tridiagonal matrix. For the SVD, we can reduce the matrix to a 
bidiagonal form: we find real orthogonal or complex unitary Up and Vo where 


ay Bi 
a2 (2 
a3 
(2.5.12) ts AVS Beat form >n. 
Ok 
000 0 
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We can do this by setting A’ < P;A where P; is a Householder reflector where 


ai * 

a2| 0 
P = 

aml 0 


For the next step, we find a new Householder reflector P; where 


a\2 * 
_ | a3 0 
Py} . |= : 

ain 0 


2 
all entries of the first column except a}, equal to zero. Then 


maf] =f] 


2 0 
Recursively applying this approach to Ato give U, A V, = Bis bidiagonal. Then 


ae end -Cat7 


\a" BV o| B 


a 1 ae, 
Setting Up = Pi. ka and Vo = _— , we have B = Up A Vo which 
Uj Po Vi 


is bidiagonal. 


1 
and set A” < A’ —T | This did not change the first column of A’ and so keeps 


Once we have bidiagonal B = UA Vo we can develop chasing algorithms with 


shifts. We should simultaneously compute QR factorizations B = QS' and a = 
PR. Since B is an upper bidiagonal matrix it is already upper triangular and so we 
can set Q = J. But we still need to compute the QR factorization of B’ . This can be 
done using a chasing algorithm that preserves the bidiagonal structure of B. At each 
step we apply a Givens’ rotation either from the right or from the left. Let G; be the 
Givens’ rotation applied from the left that is applied to rows j and j + 1, while K; 
is the Givens’ rotation applied from the right that is applied to columns j and j + 1. 
The chasing algorithm is illustrated by Figure 2.5.8. 

From the implicit Q theorem (Theorem 2.29) once we have decided on K;, the 
rest of the Givens rotations will specify the QR factorization in an essentially unique 
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* * 0 * oe + 
+ * QO * x 
5 = kx = 
kx 
*K 
kx *« O *x Ox * 
*x x * + x 0 
= + * es 0 * a) * Ox = 
* + x 
* x kx x Ox 
a kx ¥ * Ox A x Ox 
3 x «x + | x *« QO] 3 *k Ox 
O « x * Ok * Ox 
* + * 0 * 


Fig. 2.5.8 Chasing algorithm for bidiagonal matrix 


way. The problem is, how to decide on what K, should be? To answer that, we need 
to understand how to do the shifting. 

The singular values are the square roots of the eigenvalues of B’ B. The shift 1 
should be smallest eigenvalue of the bottom 2 x 2 submatrix of B’ B. Using form 


(2.5.12), this submatrix is 
a, + a Qn—18n—1 
Qn—18n—-1 On jee 


The shift js is subtracted from the diagonal of the top-left 2 x 2 submatrix: 


Ea a1 i 


ony [of + A — p 


We choose K, so that 


[ot- 1, av] [4 S| = [4,0]. 


—S1/C1 


Again, an accurate value of the shift mainly helps to increase the speed of con- 
vergence, but errors in js do not directly cause substantial numerical errors in the 
solution. Combining all these elements together gives the Golub—Kahan algorithm 
for computing the SVD [104]. 
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2.5.5 The Lanczos and Arnoldi Methods 


Computing the eigenvalues of large matrices is a challenging problem. Usually only 
a few eigenvalues are desired. We have seen the power method used for computing 
the dominant eigenvalue for the Google PageRank algorithm (Section 2.5.1). Other 
problems may require the determination of multiple eigenvalues and their eigen- 
vectors. For a real matrix A, a common approach is to find an orthonormal basis 
for a subspace forming the columns of V,,, and then computing the eigenvalues of 
V," AVin. The subspace used is typically a Krylov subspace. And so we will use the 
Arnoldi and Lanczos iterations (Algorithms 29, 30). If we want an approximate SVD 
of a large matrix, then Lanczos bidiagonalization (2.4.17) may be more appropriate. 


2.5.5.1 The Lanczos Method 


The Lanczos iteration (Algorithm 30) for a symmetric n x n matrix A gives a sym- 
metric tridiagonal matrix T,, and ann x m matrix of orthonormal columns V,,, where 
T 

A Vin = Vin Tn + Bn Unt+1e n- 


Normally we compute y where T,,y = Ay with y 4 0 and then our approximate 
eigenvector is v = V,, y. However, there is an error in v: 


Av= AViny = VinTiny + Bin Um41n Y 
= Vin + BmVm+1¥m = Xv + Bn Um+1Ym- 


Then (v, A) is the exact eigenpair of A + E for a matrix E with ||E||, = |Gnynl. 
Since A is symmetric, the eigenvalue , is in error by no more than |G, y,,|. Thus we 
have an accurate eigenvector if (,, ~ 0 or y, * 0. This is true even if the columns 
of V,, are far from orthonormal. 

Since the Lanczos iteration does not re-orthogonalize the columns of V,,, the 
columns will diverge from being orthonormal. With each iteration of Algorithm 30, 
the multiplication by A and division by (3; will tend to amplify any perturbations, so 
that roundoff error will slowly be amplified until distant columns of V,,, are far from 
being orthogonal to each other. 

The other fact about the Lanczos method for eigenvalues is that it is much better 
at finding extreme eigenvalues than finding interior eigenvalues. The convergence 
theory is based on optimization and the Rayleigh quotient 


z’ Az 


(2.5.13) RZ) = —; 
uz 
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The maximum of R(z) is the largest eigenvalue of A provided A is symmetric. 
For A symmetric, 
Az(z"z) — (27 Az)z_ 
(272)? 


VR(z) =2 


V R(z) = Oif and only if z is an eigenvector of A. In this case, the eigenvalue for z is 
R(z). A maximum exists because we can restrict attention to all z where ||z||, = 1; 
this is a compact set so any continuous function on this set will have a maximum and 
a minimum. So the maximum value of R(z) is the maximum eigenvalue of A. 

Suppose the starting vector of the Lanczos iteration is z; with ||z; ||, = 1, and that 
A has eigenvalues A; > Az > A3 > ---: > Ay. Let v1, v2,..., UV, be an orthonormal 
basis of eigenvectors of A where v ; has eigenvalue \;. Let ¢; be the angle between v, 
and z;: cos dj = |zi v1 |. The main theorem regarding convergence is due to Kaniel 
and Paige [136, 196]. 


Theorem 2.31 (Lanczos convergence) Under the notation above, if 0, is the maxi- 
mum eigenvalue of Ti, then 


(Ay — An) tan? dy 


MSA >A 
oases Tm—1C1 + 2p)? 


where Ty, is the Chebyshev polynomial of degree m — 1 (4.6.3) and p = (A, — 
A2)/(A2 = An). 


For a proof, see [105, Sec. 9.1.4, pp. 475-479]. The basis of the proof is the fact 
that 0; is the maximum eigenvalue of 7;,-) = VA Vin-1 Where range(V,;,-1) = 
Km—1(A, Z1) the Krylov subspace generated by z, of dimension m. Thus 6, is the 
maximum of R(z) over all z = p(A)z; where p ranges over all polynomials of 


degree < m — 1. If z, = ya c;v;, then p(A)z, = yO c; p(A;)v;. So 


jan GF AJPOAY 
ae cF (Aj)? , 


R(p(A)z1) = 


We can get a lower bound on this maximum by setting p(A) = Tn—1(—1 + 2(s — 
Xn) /(A2 — An)) to emphasize p(A,) compared with p(A) for any X € [A,, Az]. This 
gives the bounds in Theorem 2.31. 

Details on how the Lanczos method behaves in practice and how to obtain accurate 
estimates of eigenvalues and eigenvectors can be found in Cullum and Willoughby 
[64, 65]. 

Because T;,_(1 + 2p) = cosh((m — 1) cosh™!(1 + 2p)) © 3 exp((m — 1)2,/p) 
for p small but (m — 1),/p modest to large, convergence is fairly rapid. Convergence 
is fast enough that once a near-invariant subspace is achieved, we have (3, ~ 0. 
Small errors are amplified by division with @,,, and the method can be thought of as 
restarting with a random start vector. The method does not remember past vectors, 
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so the method again converges to the largest (and the smallest) eigenvalue of A. 
Consequently, the method gives repeated extreme eigenvalues, and even repeated 
near-extreme eigenvalues. The good news is that when the Lanczos method gives 
an apparent repeated eigenvalue, this eigenvalue is almost always accurate. One 
approach to improving the accuracy of the Lanczos method is to keep previous 
accurately computed eigenvectors and orthogonalize every new vector in the method 
against these accurate eigenvectors. This avoids the spurious multiplicity issue. The 
method is called selective orthogonalization. 

While in exact arithmetic, the Arnoldi iteration for a real symmetric matrix A is 
equivalent to the Lanczos iteration, it has an important practical difference: every 
new Arnoldi vector is orthogonalized against all the previous Arnoldi vectors. This 
is called complete orthogonalization. This prevents the bad numerical behavior of 
the Lanczos method, but at the cost of O(m7n) operations for m Arnoldi steps with 
A ann xX n matrix, compared with O(mn) operations for m Lanczos steps. 

The Arnoldi iteration produces a matrix of orthonormal columns V,, and a Hes- 
senberg matrix H,,, where 


r 


m* 


A Vin = Vin An, + Am4imUm+1e 


To compute approximate eigenvalues and eigenvectors of A, we find an eigenvector 
of An: Hny = Ay with || y||, = 1 and use v = V,, y as our approximate eigenvector. 
Then 

Av=2v+ Aintiym Ym Vm-+1 


so that v is the exact eigenvector of A+ E with eigenvalue where ||E||, = 
Linti,m Ym|- 

The cost of complete orthogonalization can be reduced by restarting the method. 
The variant of Lehoucq and Sorenson [160] provides an implicitly restarted method. 
This method captures the information in [Vin, Un4il, Hin, and hy+1,m to give a subset 
of vectors so that the method can be effectively restarted. They have a package called 
ARPACK that implements this method which is an excellent method for computing 
eigenvalues and eigenvectors of large matrices. 


Exercises. 


(1) The matrix 
5 30 
A= | -200 
121 


has eigenvalues 1, 2, and 3. Apply the power method with starting vector x9 = 
[1, 1, 1]7. Plot ||Ax, — Agxell> / llxx||. against the iteration count k. 
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(2) Show that the n x n matrix 


ies 
1 


has one eigenvalue \ = | repeated n times. Use a general eigenvalue solver to 
compute the eigenvalues of this matrix form = 5, 10, 15. Symbolically compute 
the eigenvalues of A + ce,e{ . Numerically compute these eigenvalues for « = 
f0-" anda = 5, 10, 15. 

Show the Wielandt—Hoffman bound (105, Thm. 8.1.4]: If A, E are real symmetric 
n X n matrices, and A(B) € R” is the vector of eigenvalues of B in decreasing 
order (counting multiplicities) then 


(3 


wm 


(2.5.14) |A(A + E) — A(A)Ilo < Elle. 


[Hint: Let A+ tE = Q(t)" D(t) Q(t) where Q(t) is orthogonal and D(t) diag- 
onal. Supposing both D(t) and Q(t) are differentiable in t, show that dD/dt(t) 
is the diagonal part of Q(t) E Q(t)’. From this, show 


1 
D1) — DO) = / diag(Q(t) E Q(t)") dt; 
0 


taking Frobenius norms, show (2.5.14).] 


(4) Show that the eigenvalues of the n x n matrix 


Ym 


2 -1 
-1 2 -1 


are \; = 2(1 — cos(r7j/(n + 1)) for j = 1,2,...,”. [Hint: If Ax = Ax then 
for 1 <k <n, 2x,~ — Xp) — Xe41 = Axx Which gives linear recurrence relation 
for the x;,’s. Put x, = ark + agrk where r, and rz are solutions of the charac- 
teristic equation 2r — 1 — r? = Ar. The boundary conditions can be represented 
by setting x9 = 0 and x,,,; = 0.] Use a suitable numerical method to compute 
the eigenvalues of A, forn = 5, 10, 15, 20, and compare with the exact values 
you have computed. 
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(5) Check the Wielandt—Hoffman bound numerically: start with A, from Exercise 4 


(6 


c 


(8 


wm 


wm 


wm 


for n = 10. Create a perturbation matrix E, that is symmetric, either by hand, 
or by first setting each entry to be s times a random sample of a standard normal 
distribution, and then setting E, < (E, + E7)/2 to ensure symmetry. First use 
s = 0.1. Compute the vectors of eigenvalues A(A,,) and A(A, + E,,), each with 
the eigenvalues sorted in increasing order. (What is important is that they are 
sorted consistently.) Then check if ||A(Ay, + En) — A(An) Ilo < || Enll -. Repeat 
for s = 10-7 and s = 1. 

The Schur decomposition A = lon T Q with T upper triangular and Q unitary is 
essentially unique if the diagonal entries of T (the eigenvalues of A) are distinct 
and in identical order. Show this by noting that if A = Oi T; Q) = Os To Q2 
we have T; O10, = Q10> Th. Set U = O10) , which is unitary. Show that 
if TU =U S where T and S are upper triangular with diag(T) = diag(S) and 
distinct diagonal entries and U unitary, then U is a diagonal matrix with diagonal 
entries that are complex numbers of size one. Do this by induction on where 
T and S aren x n. For the induction step let 


Pe) Ee Bl 


Note that 7 = o because diag(T) = diag(S). Show that equating TU =US 
implies u = 0,z =0,and TU = US. 

The Schur decomposition can also be used to solve Sylvester equations: find an 
m xX nmatrix X where AX + XB = C for given matrices A (m x m), B(n x n), 
and C (m x n). Since these are linear equations, they can be solved using an 


mn X mn matrix, which would take O(m?n*) floating point operations and use 


O(m?n?) memory. Instead we can use the Schur decompositions A = O'R Q 


and B =U SU where Q and U are unitary, and R, S are upper triangular. 
Create a recursive algorithm based on the splittings 


|R 


where Y = O X U The algorithm should take O(n} + mn(m +n)) operations 
once the Schur decompositions have been computed. Show that the method you 
create succeeds provided A and —B have no eigenvalues in common. 

The Lanczos iteration in exact arithmetic generates a tridiagonal matrix T,,, and 
a matrix V,, or orthonormal columns where 


T 
A Vin = VinTmn + Bn Vm+1€m- 


Implement the Lanczos method or use an implementation of it applied to the 
matrix A, from Exercise |. Use h = 1/10 as a concrete value. For applying the 
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Lanczos algorithm use m = 40 and arandom start vector. You may notice that the 
smallest and largest eigenvalues of T;, are repeated. Compute V,’ V,,. How much 
does it deviate from 7? Even if V,,, is not close to having orthonormal columns, as 
long as the columns are linearly independent, show that if 7,,z = Az and z,, = 0 
then V,,z is an eigenvector of A. Use this to argue that repeated eigenvalues of 
T are eigenvalues of A. 

The Perron—Frobenius theorem says that if A is a square matrix of non-negative 
entries then there is an eigenvalue A; > 0 (positive unless A = 0) which has an 
eigenvector v,; (Av; = A; v1, v; 4 0) with non-negative entries. Furthermore, 
every eigenvalue X of A satisfies |A| < 1. Prove that the existence of A; and v, 
implies |A| < A; for every eigenvalue of A. 


Chapter 3 M®) 
Solving nonlinear equations sheet 


Unlike solving linear equations, solving nonlinear equations cannot be done by a 
“direct” method except in special circumstances, such as solving a quadratic equation 
in one unknown. We therefore focus on iterative methods. From the numerical point 
of view, there are a number of issues that arise: 


e How rapidly do the methods converge? 
e Can the methods fail, and if so, do they fail gracefully? 
e What is the trade-off between computational cost and accuracy? 


We normally discuss the asymptotic speed of convergence as this is the easiest to 
understand theoretically and apply to practice. However, if our initial “guess” is 
far from the true solution, it may take many iterations to begin to approach a true 
solution, even if the method is asymptotically fast. 

Methods can fail by falling into infinite loops far from a solution, failing to con- 
verge, or resulting in impossible computations such as division by zero or taking 
the square root of a negative number in real arithmetic. After all, not all equations 
have solutions. How should a numerical method handle such problems? We would 
like our methods to be robust (able to fail gracefully for a difficult or impossible 
problem) and reliable (able to provide results even for difficult problems), as well as 
being accurate and efficient. 

Many of the methods we present here are applicable only to single equations 
in a single, scalar, unknown. Others can be applied to problems of arbitrarily high 
dimension, but with a corresponding loss of guarantees. Which method should be 
used depends, at least in part, on the circumstances. 


3.1 Bisection method 


The bisection method is the best method we know of for reliability and robustness. 
We start with a theorem, well known from calculus and basic analysis: 
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 181 
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Algorithm 40 Bisection method 


1 function bisect(f, a, b, €) 
2 if sign f(a) 4 —sign f(b) 


3 return fail 

4 end if 

5 while |b-a|l>e 

6 m <—(a+b)/2 

| if f(m)=0: return m; end if 
8 if sign f(m) = sign f(a) 

9 a<m 

10 else // sign f(m) = sign f(b) 
11 b<m 

12 end if 


13 end while 
14 return (a+b)/2 
15 end function 


Theorem 3.1 (Intermediate Value Theorem) Suppose that f : [a, b] — R is contin- 
uous and y is areal number between f (a) and f (b). Then there is ac € [a, b] where 


f)=y. 


The special case with y = 0 gives the fact that if f(a) and f(b) have opposite 
signs, then there must be a c between a and b where f(c) = 0. This theorem does 
not give a way of computing the value of a solution c. The bisection algorithm 
remedies this problem. In fact, the bisection algorithm can give a constructive proof 
of intermediate value theorem. Pseudo-code for the bisection algorithm is given in 
Algorithm 40. 


Example 3.2 As an example, take f(x) = x e* —3. Then f(1) = 1x e! -3= 
e—3 <0 since e = 2.718.... On the other hand, f(2) = 2 x e—3>2x2?- 
3 = 5 > 0. Thus we can take a = 1 and b = 2. Then applying the algorithm gives 
the results in Table 3.1.1. 


3.1.1 Convergence 


To study the convergence of this method we need to introduce some notation: let ax 
and b, be the values of a and b at the start of the while loop (line 5) after completing 
k passes through the body of the while loop. The initial values of a and b are then 
do and bo. 

The most important facts for establishing convergence are that |by41 — ag4i| = 
5 [Dx —a,| and assuming do < bo we have ag < agi, < be41 < De for kK =0, 
1,2,.... Note that if bp < ap then corresponding inequalities still hold, but in the 
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Table 3.1.1 Results of bisection for f(x) = x e* — 3 starting with [a, b] = [1, 2] 


k ak f (ak) by f (bk) 

0 1.0 —0.28171817 2.0 11.7781121978 
1 1.0 —0.28171817 1.5 +3.7225336055 
2 1.0 —0.28171817 1:25 +1.3629286968 
3 1.0 —0.28171817 1.125 +0.4652439550 
4 1.0 —0.28171817 1.0625 +0.0744456906 
5 1.03125 —0.10778785 1.0625 +0.0744456906 
6 1.046875 —0.01773065 1.0625 +0.0744456906 
7 1.046875 —0.01773065 1.0546875 +0.0280898691 
8 1.046875 —0.01773065 1.05078 125 +0.0051130416 
9 1.0488281 —0.00632540 1.05078 125 +0.0051130416 
10 1.04980469 —0.00061033 1.05078 125 +0.0051130416 
reverse direction: by < byy) < a4) <a, for k =0,1,2,.... These facts can be 


readily established from the code in Algorithm 40. 
Mathematical induction can then be used to show that, provided ag < bo, 


|by — ax] =2~* |bo — ao], and 


ag < ae XS Ags < Dey < De < bo 


fork = 0,1, 2,.... From this, we can see that the sequence a; is a non-decreasing 
sequence that is bounded above by bo, and that b; is a non-increasing sequence that is 
bounded below by ap. Since bounded monotone sequences converge, limy_s 9 dx = @ 
and limy-. 5 by = b for some @ and b. On the other hand, 


[6 —@| = lim |b, — ag| = lim 2~ [bo — ap| = 0, 
k-0o k-0o 


so @ = b. Now sign f (a,) = sign f (a9) = —sign f (bo) = —sign f (b;,) for k = 0, 
1,2,.... If f(a) > 0 then f(bo) < O and so f(a,) > O > f (bx) for all k. Taking 
the limit as k — oo we use continuity to see that f(@) >0> f (b) = f(a) and so 
{@=f (b) = Oandc =@ = bis the solution we are looking for. We can treat the 
case where f (dao) < 0 < f(bo) in the same way, by reversing the directions of the 
inequalities. 


3.1.2 Robustness and reliability 


The convergence results of the previous section do not depend on the function given to 
the bisect function. This makes the bisection method a very reliable one. Convergence 
only requires that f is continuous on [a, b]. 
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The bisection method is also robust to failures in our assumptions. For exam- 
ple, even if f is not continuous, the method can still converge, although not 
necessarily to a solution. A specific case is f(x) = tanx on the interval [1, 2]. 
Then the bisection method converges: ax, by > 1/2 as k > oo. Now 2/2 does 
not satisfy tan(zr/2) = 0. The problem is that tan(z/2) = sin(z/2)/cos(z/2) and 
cos(r/2) = 0 so tan(z/2) is undefined. However, if x ~ 2/2 and x < 2/2 then 
tan(x) > 0; if x © w/2 and x > 7/2 then tan(x) < 0. 

The point is that even in the few extreme cases that the bisection method fails, it 
does so gracefully. There is one unavoidable limitation with the bisection method: it 
can only be applied to solving a single scalar equation in a single real variable. The 
fundamental reason is that there is no easy generalization of the intermediate value 
theorem to two or more dimensions. 

While the bisection method is extremely robust and reliable, the same cannot be 
said about most other algorithms for solving equations. 


Exercises. 


(1) Solve the equation x e* = 4 to an accuracy of 10~° using the bisection algorithm 
on f(x) = xe* —4. 

(2) Solve the equation e* = 2 + x? to an accuracy of 10~° by using the bisection 
algorithm on f(x) = e* — (2+ x*) with a = 0 and b =2. 

(3) Solve the equation of the previous question to the same accuracy by using the 

bisection algorithm on g(x) = x — In(2+ x’) and a = 0 and b =2. Compare 

the two computed solutions. Why are they the same? 

The equation tan x = x has infinitely many solutions (x = 0 is one of them), but 

your task is to find the one closest to 32/2. But beware! The solution is quite 

close to 37/2, and x = 37/2 is a singularity of tan x. Report an estimate of the 
solution with an error of no more than 10~°. 

Solve the equation f(x) = Owhere f (x) = x° — 10x4* + 40x? — 80x? + 80x — 

32 with starting interval [a, b] = [1.3, 3.2] to an accuracy of 10-}, Compare 

this with solving g(x) where g(x) = (x — 2)°. Explain the difference in the 

results, given that f(x) is simply g(x) expanded symbolically. 

The stopping criterion in the bisection method is “|b — a| < e”. For most other 

algorithms the stopping criterion is “| f (x)| < e€”. Show that if f is smooth and 

€ is small, then these two stopping criteria are within a factor of approximately 
|f/@*)|. 

(7) Apply the bisection method to solve cos(1/x) = 0 on the interval (10-7, 1]. 
There are infinitely many solutions of cos(1/x) = 0 in the interval (0, 1). Which 
one is chosen? 

(8) Why can’t the bisection method be generalized to solving two equations in two 
variables? 
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3.2 Fixed-point iteration 


Fixed-point iteration is a simple algorithm: 


for k=0,1,2,... 


Xk+1 — B(Xr) 
end for 


If we consider “‘x;,”” not be just a single number, but rather something more complex, 
fixed-point iterations represent a very wide range of computational processes. For 
example, if x; € R”, an n-dimensional real vector, then the iteration x,+; <— g(X,) 
fork = 0, 1, 2,3,... represents a large number of computational processes. 

If g is acontinuous function R” — R" then if the x;’s converge to a limit X, then 
this limit is a fixed point of g: 


¥ = lim xy4,; = lim g(x,) = g(lim x,) = g(@). 
k>0o k>0o k—> 00 


That is, X is a fixed point of g. There are well-known conditions under which con- 
vergence is assured. 


Theorem 3.3 (Contraction mapping theorem) Suppose g: R" — R" is a contrac- 
tion mapping (that is, there is a constant L < 1 where ||g(u) — g(v)|| < L |lu — | 
forallu, v € R"), then the iterates xx of X41 <— g(x,) fork = 0, 1,2, ... converge 
to the unique fixed point ¥: g(x) =X. 


Contraction maps, then, are very useful for fixed-point iterations. But contraction 
maps are hard to find. 
Consider the problem of solving x e* = 3. We can re-write this equation as 


x=3e%, or 
x = In(3/x) =1n3—Inx. 


The corresponding fixed-point iterations are 


We let 


gi(x) =3e%, and 
g(x) = In3 —Inx, 


so we have two fixed-point iterations 
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Table 3.2.1 Results for two fixed-point iterations 
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|? Fae) x FOL) 
0 1.05029297 +2.250 x 1073 1.05029297 +2.250 x 1073 
1 | 1.04950572 —2.361 x 1073 1.04954314 —2.141 x 1073 
2 | 1.05033227 +2.480 x 1073 1.05025732 +2.041 x 1073 
3 | 1.04946449 —2.602 x 1073 1.04957709 —1.943 x 1073 
4 | 1.05037559 +2.735 x 1073 1.05022498 +1.852 x 1073 
5 | 104941902 —2.868 x 1073 1.04960788 —1.763 x 1073 
6 | 1.05042334 +3.014 x 1073 1.05019564 +1.680 x 1073 
7 | 1.04936891 —3.162 x 1073 1.04963581 —1.599 x 10-3 
8 | 1.05047598 +3.323 x 1073 1.05016902 +1.524 x 1073 
9 | 1.04931368 —3.485 x 1073 1.049661 16 —1.451 x 1073 
10 | 1.05053401 +3.663 x 1073 1.05014488 +1.382 x 1073 

a <_ aie’), and 

Ker — B2lay): 


We will start with the solution estimate provided by the bisection method after 10 
steps (1.05029297) and use f(x) = x e* —3 to estimate how close the solution 
estimate is to the exact solution: if x ~ x* where f(x*) = 0, then f(x) © f(x*) + 
Sf (x*)(x — x*) = f'(x*)(x — x*). Using x* © 1.05 gives f’(x*) © 5.86, so f(x) 
is approximately 5.86 times the error, provided the error is small. The results are 
shown in Table 3.2.1. 


It appears from Table 3.2.1 that the error for x 


_ is growing in size, but slowly, 


while the error for i is shrinking in size, but slowly. If this trend continues, then 


eo will not converge to the solution x*, while a will. Why the difference? This 
is the topic of the next section. 


3.2.1 Convergence 


If we consider a scalar fixed-point iteration 

Xk+1 — B(x) fork =0,1,2,..., 
we suppose that x, — x* as k — oo where g(x*) = x* is a fixed point. From the 
mean value theorem, there must be a c, between x* and x, where g(x;,) — g(x*) = 


8" (cx) (x, — x*). So 


Xep1 — 4” = g(xn) — x* = g(x) — g(x") = 8! (cx) (xe — x”). 
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Writing e; = x; — x*, which is the error in x;, we see that 
Cnr = 8 (Ck) ek. 


If x, — x* as k — oo, then by the squeeze theorem, cy, — x* as well. Provided g’ 
is continuous, we see that 


lim “4 — tim g/g) = 2'(x"). 
k-+>0o ex k->oo 

Theorem 3.4 (Convergence of fixed-point iterations) Consider the fixed-point iter- 

ation Xx41 <— g(xx) fork =0,1,2,3,... where g is differentiable and g(x*) = x". 

Then if |g’ (x*)| < | thereisaé > Owhere x, > x* ask — ow provided |xo — x*| < 

6; if |g’ (x*)| > | then either x; = x* exactly for some k or x, 7 x*. 


Before we start with the proof, it should be noted that the existence of ad5>0 where 
convergence is guaranteed for all x9 within 6 of x* can be colloquially described as 
“convergence is guaranteed for all xo sufficiently close to x*”. How close “sufficiently 
close” is, of course, depends on the iteration function g. 

In the alternative case where | g' (x*) | > 1, there is the possibility that at some point 
in the iteration, we have x, = x* exactly. This is wildly improbable, but it is possible 
to contrive instances where it occurs. For example, if g(x) = x? then x) = —1 will 
give x, = +1, which isa fixed point of g. Any other starting point other than xy = x* 
will result in either divergence of the sequence x, from the fixed-point iteration or 
convergence to the other fixed point (which is zero). 


Proof (Of Theorem 3.4.) Suppose that |g/(x*)| < 1. Let e = $(1 — |g’(x*)|) > 0. 
Then by continuity of g’, there is a 6 > 0 where |x — x*| <6 implies |g’(x)- 
g'(x*)| < €. Let L = $(1 + |g’(x*)|) which is less than one. Thus if |x — x*| < 6, 


1 
|e (x)| < |g’@*)| +e = |e’@*)| + ae la) ab <1. 
So if |x — x*| <6, 


|g(x) — ll — |g(x) — g(x*)| = rales — ell for some c between x and x* 
= |g"(c)| |x — x*| <L |x — x*| < |x — x*| <6. 

Therefore, g maps the interval [x* — 6, x* + 6] onto itself. In particular, since 

|xo — x*| < 6, we have |x, — x*| < 6 for k = 0, 1, 2,3,.... Furthermore, for any 


x*— 8 <x, <x*4+6, 


|xea1 — x*| = |e(xe) — g%*)| = |e’ (OG — x*)| = |’ [xe — x*| 
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for some c between x; and x*; therefore, c is in [x* — 6, x* + 6] and so lg"(c)| <L. 
Thus |xp41 — x*| < L |x, — x*| for k=0, 1, 2,.... Then |x,—x*| <L* |xo—x*| 0 
as k — oo. That is, x, > x* ask > oo. 

In the alternative case where |e’(x*)| > 1, suppose that x, ~ x* for any k. Then 
supposing that x, — x* as k > co would imply that 


2 
. [Xept — X 
fa Sl |g’(x*)| > 1 
k>o0 [xz — x*| 
using the arguments immediately preceding the statement of the theorem. That is, 
there is a K where k > K implies that 


esl 
[xz — x*| 


Therefore, |x; — x*| > |xx — x*| > 0 for all k > K, which contradicts the assump- 
tion that x, — x* as k + oo. Thus, the only way in which x, > x* ask > oo in 
this alternative case is if xe = x* exactly for some @, and thus x; = x* forall j > ¢. 


Example 3.5 As a practical application of this result, we consider the iterations 
ae < gI Ge) and Pa wiat < gy xe) given in the previous section. The fixed point 
of interest is x* ~ 1.05, so convergence (or divergence) is determined by | g(x*)|. 
For the first iteration, |g (x*)| © |g (1.05)| © |—1.05] = 1.05 > 1 while |g5(x*)| ~ 
| 25(1.05)| = |—1/1.05| + 0.95 < 1. Thus, the first iteration roughly increases the 
error by 5%, while the second iteration roughly decreases the error by 5%, with each 


iteration. 


Note that smaller values of | g’ (x*)| result in faster asymptotic convergence. The 
optimal case of | g'(x*)| = 0 is actually of more than theoretical importance as we 
will discuss later in connection with Newton’s method. 


3.2.2 Robustness and reliability 


We have seen two iteration functions for solving the same problem: one gave conver- 
gence, the other divergence. Clearly more needs to be understood about the iteration 
for it to converge. Furthermore, even if the convergence condition |e’ (x*) | < lholds, 
convergence is only guaranteed if “‘xo is sufficiently close to x*”. That is, convergence 
is only known to hold locally, around the exact solution. 

Worse than this, general fixed-point iterations can result in extremely bad behavior. 
Take, for example, g(x) = x”. The fixed points are x = Oand x = 1. Butif we start at 
xo = 2wegetx, = 27, x) = (22)? = 24,.x3 = (24)? = 28, and, in general, x, = 22”, 
which grows very rapidly in size. 
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In another example, the iteration x44; = 4x,(1 — x;,) is known to have all iterates 
xx € (0, 1] if xo € [0, 1], but the behavior of these iterates can be extremely hard 
to predict. In fact, this is a standard example in nonlinear dynamics of a “chaotic” 
system. To see better how it behaves, write x, = sin? 6. Then 


Xea1 = 4xx(1 — xy) = 4 sin? 6, (1 — sin? 6) 


= 4sin? 0; cos” & = (2 sin A cos %)* = sin?(2,). 


Writing 0.4; = 26, mod z gives an exact solution which, nevertheless, exposes 
the “chaotic” nature of its dynamics: The behavior of the iterates depends on the 
binary expansion of 09/27; each iteration shifts this binary expansion one place left 
and zeros out the bit(s) before the “binary point”. The two fixed points are x = 0 and 
x = 3/4; in terms of @ they are 6 = 0 and @ = 27/3. Both fixed points are unstable. 
Typical behavior of the iterates x, is aperiodic and does not converge to fixed point 
or periodic orbit. 

The range of possible behavior of the iterates of fixed-point iterations is enormous, 
especially for multivariate fixed-point iterations, which are covered in the following 
section. We will see later that this family of methods can be extremely fast as well 
as frustratingly slow. 


3.2.3. Multivariate fixed-point iterations 


Consider iterations of the form 
(3.2.1) X41 <— B(Xx) fork =0,1,2,... 


with x; € R”. We again assume that g is differentiable and that there is a desig- 
nated exact fixed point x* = g(x*). We use a multivariate version of the mean value 
theorem: 


1 
(3.2.2) g(v) — g(u) = / Vge(ut+s(v—u))(v—u)ds. 
0 


We assume that we are using compatible matrix and vector norms, both repre- 
sented by ||-||. 


Theorem 3.6 Jf Vg is continuous and ||V g(x*)|| < 1 then there is a 5 > 0 where 
|xo — x*|| < 6 implies the iterates x, of (3.2.1) converge to x”. 


Proof Lete = (1 — ||Vg(x*)||)/2. Let L = (1 + || Vg(x*)|])/2 < 1. By continuity 
of Vg, there is a 6 > 0 where ||x — x*|| < 6 implies || Vg(x) — Vg(x*)|| < €. We 
define the closed ball B(x*, 5) = {x | ||x — x*|| < 6}. Thenifx € B(x*, 5) we have 
IVg(x)ll < IVg@")|| +e =L < Land 
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1 
g(x) —x*| = |g) - gx*)| = | Va(x +s(x* —x)) (x —x*)ds 
1 
< 1 | Vg(x + s(x* — x)) &* —x)|| ds 
0 
1 
</ | Ve(x + s(x" —x))|| ||x* —x| ds 
0 


1 
ay L |x* —x|| ds =L|x*—x| <L6 <6, 
0 


and so g(x) € B(x*, 4). First, we see that g maps B(x*, 5) > B(x*, 5). Next we see 
that if x, € B(x*, 5) then ||/x,4; — x*|| < L |lx, — x*||. Therefore, provided x9 € 
B(x*, 5), ||x~ —x*|| < L* |lxo — x*|| > Oask > oo, andsox, > x* ask > oo. 


The alternative case that || Vg(x*)|| > 1 does not imply a lack of convergence. If 
g is linear, then Vg is constant, and the question of convergence of the fixed-point 
iteration amounts to asking if A* — Oask — oowhere A = Vg(x*). We know from 
the properties of matrix norms that | Ar | < || Aj)‘, but there is no reverse inequality. 


In fact, we can have A? = 0 with ||A|| > 1, suchas A = 


|All, = 2. 
The question of which matrices A satisfy A‘ > 0 as k — oo is answered by the 
spectral radius (2.4.5): 


2 : 
0 5 for which ||Al],, = 


(A) = max {|A|: A is an eigenvalue of A}. 
For a square matrix A, Theorem 2.15 shows that 
Ak + Oask > 00 if and only if p(A) < 1. 


It should be noted that for any induced matrix norm, ||A|] > o(A). To see why 
this is so, suppose v is an eigenvector of A with eigenvalue A: Av = Av with v 4 0. 
Then || Al ||v|| > || Av|] = ||Av|] = [A |lv|| and dividing by ||v|| > 0 gives || Al] > [A 
for every eigenvalue 1. Taking the maximum over all eigenvalues gives || A|| > o(A) 
as desired. 

Using Theorem 2.15 with Theorem 3.6 we can show convergence for a fixed-point 
iteration provided p(Vg(x*)) < 1. 


Example 3.7 Here is an example of a fixed-point iteration to solve 


x3—x74+2y4+3=0, 
e+xytx—x?-10=0. 
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10°F 


Xx — g(xx)|| 


0 5 10 15 20 
k 


Fig. 3.2.1 Convergence of multivariate fixed-point iteration example 


The iteration function is 
x x? —2y —3)!8 
a pee | 
y Indl0+ x x — xy) 


With starting value x9 = [xo, yo]? = [0, 1], the values of |x; — g(x;)||, are shown 
in Figure 3.2.1. Geometric convergence is apparent from the figure. Furthermore, we 
can compute 


Ve(x*) ~ | 20:370825| 0.206379 
BUS 1™ | 0.372222] +0.087879 | ’ 


||Vg(x*)|, ~ 05713577, while 
p(Va(x")) © 0.5013064. 


Estimating the slope in Figure 3.2.1 gives a reduction by a factor of © 0.49686 per 
iteration, which is quite close to the spectral radius p(V g(x*)) © 0.5013064. 


Exercises. 


(1) Consider the equation e* + x* = 4. We look for positive solutions. Show, using 
the fact that (d/dx)(e* + x’) > 0 for x > 0, that there is only one positive 
solution. Consider the two iteration schemes 


Xnt1 = V4—e%, 


Xn+1 = In(4 — x2). 
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Which of these methods converges near to the solution? What is the rate of 
convergence? 

The contraction mapping theorem (Theorem 3.3) implies that if f: R — R is 
differentiable and | ci (x)| < L for all x, with L < 1, then f has a unique fixed 
point. However, show that the function f(x) = x + e~* maps [0, co) — [0, oo) 
has f’(x) < 1 for all x, and yet has no fixed point. Explain the paradox. 
Suppose that f: R” — R” has the property that V f(x) is positive defi- 
nite (but not necessarily symmetric!) with Apin(V f(x) + Vf (x)") = a@ for 
all x with a > 0, and bounded (||V f(x)||, < B for all x € R”), then show 
that g(x) = x — (a/(2B*)) f (x) is a contraction mapping. [Hint: Show that 
IVg(x)Il5 < 1— (@/(2B)). You can use ||Vg(x)zllj = z7V g(x)" Ve(x)z 
and expand Vg(x) = J — (a/(2B*))V f (x).] 

Fixed-point iterations can be very slowly convergent if |g’ (x*)| = |. Take, for 
example, x4; = sinx,. Starting from xp = 1, note that all x, > 0. Plot x, 
against n on a log-log plot. Estimate C and a where x, ~ Cn~% as n —> ow. 
Use the fact that sin x ~ x — x3/6 to explain these values of C and a. [Hint: 
Match C(n + 1)~* ~ Cn-* — (Cn~*)3/6 as n > 00.] 

Solve the simultaneous equations below using the obvious fixed-point iteration. 
Perform 20 iterations of fixed-point iteration. 


Ind + x7 + y*) — y/24+1, 
exp)? p= 


Xx 


y 


What is the solution x* you obtain? What is the Jacobian matrix for the iter- 
ation function g at x*? What convergence rate is predicted by the Jacobian 
matrix? What is the convergence rate you obtain empirically? Are the two rates 
consistent? 

A method of accelerating a fixed-point iteration x,+1 = g(x») is to assume that 
Xn © x* + ar” forlargen. Using x, © x* + ar‘ fork =n—1, n, n+ 1solve 
the approximate equations for a, r, and x* from x,_1, X,, and x,4,. The value 
of x* obtained from this is not expected to be the exact solution, but rather a 
better approximation to it. [Hint: Use (4,41 — %»)/(%n — Xn—1) to estimate r.] 
Apply the acceleration method of Exercise 6 to the convergent iteration in 
Exercise 1. How quickly do the accelerated iterates converge? 

A vector version of the acceleration method of Exercise 6 is to suppose 
that x, *x* +ar*. Estimate r using r © (Xn41 —Xn)! (Xp — Xn—1)/n — 
Xn—1)' (Xp — Xn-1)- Develop formulas for the estimates of a and x*. Use it 
to accelerate the fixed point in Exercise 5. What rate of convergence does the 
accelerated estimates of x* have? 

Another idea of accelerating fixed-point iterations is to suppose that if x* is 
the exact fixed point, then x,4; —x* = g(x,) — g(x*) © A(x, —x*). If A 
has been estimated, then show that x,,; — Ax, ~ (J — A)x* and so a better 
estimate of x* can be computed. 
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(10) If 
0.9 2 
0.9 2 
A= td, (20 x 20), 
0.9 2 
0.9 
compute ||A” ||, form = 1, 2,..., 50, and plot these values against n. Check that 


A” — 0 exponentially fast as n — oo. Explain why we seem to need large n 
for this to become apparent. 


3.3 Newton’s method 


Newton’s method is a method for solving equations that is sometimes taught as part 
of an introductory calculus course. Nevertheless, it is a powerful and general tool 
that finds many applications. 

A way of deriving Newton’s method in one variable is to use the tangent line 
through a point on the graph to obtain a new estimate of the solution x* of f(x) = 0, 
as illustrated in Figure 3.3.1. 

In Figure 3.3.1, the initial guess xp is used to create the tangent line which crosses 
the x-axis at x,. Then the tangent line at x, gives the next guess x2, which in turn 
leads to the next guess x3, which is already much closer to the solution x* where 
f(x*) =0. 

To derive the method in a more formal way, we note that if x ~ xo then 
f(x) © f(xo) + f’ (xo) (x — x0). Instead of trying to directly solve f(x) = 0, we 


L3 


L1 XL2 XL 


Fig. 3.3.1 Newton’s method 
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Algorithm 41 Newton’s method 

1 function newton(f, f’, xo, €) 

2 k<0 

3 while |f(x)| >e 

4 Xp <— xe — f (XK) /f' (xe) 
5 k<—k+1 
6 
7 
8 


end while 
return Xxx 
end function 


solve f (xo) + f’(%o)(x — x9) = O we obtain 


Using this as the next guess x,, and repeating the process gives Algorithm 41. 
Newton’s method is a particular kind of fixed-point iteration x,4; < g(x,) where 


g(x) =x — f(x)/f'@). 


3.3.1 Convergence of Newton’s method 


To get an empirical example of how Newton’s method converges, consider the 
problem of solving e* — x” — 3 = 0. We apply Newton’s method with initial guess 
xq = 1. The results are shown in Table 3.3.1. 

Table 3.3.1 shows the typical rapid convergence of Newton’s method, once the 
iterates start approaching the true solution. The final entry shows f (x7) is a small inte- 
ger multiple of unit roundoff. This means that the actual value of f (x7) is very close 
to zero. Also note that the exponents of the values f (x;,) roughly double with each 
iteration for k > 3. This is indicative of quadratic convergence: |eg4| < C le,|*. 


Table 3.3.1 Newton’s method applied to e* — x7 —-3 =0 


k Xk F (xk) 

0 | 1.000000000000000 —1.2817 x 10° 

1 | 2.784422382354665 +5.4375 x 10° 
2 | 2.272498962410127 +1.5394 x 10° 

3 | 1.974092118993297 +3.0304 x 107! 
4 — | 1.880903342392723 +2.1630 x 107? 
5 | 1.873171697204092 +1.3577 x 1074 
6 | 1.873122549684362 +5.4455 x 10~° 
7 1.873122547713043 —4.4409 x 107!¢ 
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Theorem 3.8 (Convergence of Newton’s method) Suppose that f has a continu- 
ous second derivative, f(x*) =0, and f’(x*) #0. Then there is a & > 0 where 
|xo —x*| <6 implies that x, — x* and (xp41 — x*)/(x_ — x*)? converges to 


f'*)/ (2 f'(x*)) as k > 00. 

Proof Newton’s method is 

(3.3.1) Xeg1 = Xe — f XK)/f' Ox): 

Using Taylor series with remainder, and e, = x, — x*, we see that 
SK) = FO) + FO") ee + sf "ce) es 
Fe) = FO) + f'(di) ee 


where c; and d; are between x, and x*. Subtracting x* from both sides of (3.3.1) 
gives 


Te S (xx) 
f' (XK) 
=o, — POVF FO Mee + Fd & 
FD + Fe) er 
2 CF'*) + 5 f" (ce) ex) & 


f'*) +f" (dk) ek 
cag c fiat) + re] 
F'(x*) +f" (dk) ek 
Lf" (de) — $f (cx) ex 
F'*) + Fda) ek 
_ op Fd) a) 
OF Fae) +f (de) ek 


= ek 


Choose C > ler Geer GI: By continuity of f”, there is a 6; > 0 such that 


" —_ len 
ac ae oe forall wel aux Pel; 


Then provided |e,| < 5, we have |eg41| < C lex |? = (C lexl) lex|. If we also have 
C lex| < 5 then |ex41| < 5 le, |. Let 62 = 1/(2C) > 0. Then provided |e;,| < min 
(51,62), we |é4i| < C |e, |? < 5 lex. This in turn implies that |ez41| < lex| < 
min(6,, 62); from this we can see that e, — Oask — oo provided |ep| < min(6,, 52). 
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Table 3.3.2 Quadratic convergence 


k 0 1 2 3 4 
C lex 5x 107! 25x107! = |6.25x 107? |3.91 x 107? | 1.53 x 107 
k 5 6 7 8 9 

C lex| 2.33 x 107! | 5.42 x 10-29 | 2.94 x 10-39 | 8.64 x 10778 | 7.46 x 107155 


Now that convergence has been clearly established, 


een Fd) = 5 FC) FO) 
e; fl (x*) + f(a) ek f'(x*) 


ask > oo, 


since cx, dy — x* and e, > Oask > oo. 


To give a better idea, just how fast this rate of convergence is, suppose that |ex41| < 
Cc |e; |?. Multiplying by C gives C lexai| < (C leg|)?. If C Jeo| < 1/2, then Table 3.3.2 
shows how quickly C |e,| goes to zero. 


3.3.2 Reliability of Newton’s method 


A major problem is getting a starting guess x9 that is “close enough”. How close 
is “close enough” depends very much on the problem. And, yes, Newton’s method 
can fail to converge. As a simple example, consider the equation tan~' x = 0. The 
solution is easy: x = tan0 = 0. But let’s try using Newton’s method: 

tan! x, 


~ G/d+x2) _ 


2 -1 
Xn = Xp Xn — (1+ x7) tan™ Xp. 


Convergence, or failure to converge, depends on the starting point. Table 3.3.3 shows 
what happens with two different starting points. 
The precise point where convergence gives way to divergence can be precisely 
computed for this problem, although this is not the case with most problems. 
Another case where convergence may be (somewhat) in doubt is the case where 
f'(x*) = 0. This is ararer case, since it requires exact equality, but cases where f’(x*) 
is small may cause numerical difficulties even if asymptotic quadratic convergence is 


Table 3.3.3 Newton’s method applied to tan~! x = 0 
k 0 1 2 3 4 


x |i —5.707 x 107! 1.169 x 107! | —1.061 x 1073 7.963 x 10719 
xe | 2 —3.536 x 10° 1.395 x 10+! | —2.793 x 10+? 1.220 x 10+5 
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assured. To understand the typical situation where f’(x*) = 0, assume that f(x) = 
(x — x*)? g(x) where p > 1 and g(x*) £0. Then f’(x) = p(x — x*)?"! g(x) + 
(x — x*)? g'(x). Newton’s method then becomes 


Xn4t1 = Xn — Un) 
F'n) 
2 (oy = 2°)? gn) 
"pin — x*)P7! B(Xn) + Gn — x*)? 8! (Xn) 
= (Xn — x*) B(%n) 


P8(Xn) + On — X*) B/(Xn) 


Subtracting x* from both sides, and writing e, = x, — x*, 


Cn 8(Xn) 
Pp (Xn) + en 8! (Xn) 


eCn+1 = en 


nae 
P+ €n(g!(Xn)/8(Xn)) 


$0 €n41/€, > 1— (1/p) £0 if e, — 0 as n > ow. That is, Newton’s method in 
these circumstances shows linear convergence, not quadratic convergence. Never- 
theless, it still converges provided xo is, again, “sufficiently close” to x*. 

The main reliability issue is then about starting points xo not being “sufficiently 
close”. Starting far from the solution x* (assuming there is only one) means that 
the derivative f’(xo) does not necessarily give useful information about the correct 
direction to move in. There are going to be limits on what algorithms can achieve. 
After all, not all equations have solutions. In the next section, we look at one way of 
improving the reliability of Newton’s method. 


3.3.3 Variant: Guarded Newton method 


The failure of the Newton method in the case of solving tan! x = 0 with x9 = 2 can 
be located in the fact that the linear approximation f(x) © f (xo) + f’(xo) (x — xo) 
is not a good one if x — x9 is large, and x; — x9 = —f (Xo)/f’(%o) is large. The step 
we take in the Newton method, dj = — f (xx)/f' (xx), can be very large. When d; 
is large, we should take a step that is fraction of this: x,41 = x, + 5d, for some 
O<s<l. 

Where possible, we should use s = 1, because then we are taking a full Newton 
step which gives quadratic convergence. But we need some way of checking the step 
to see if it significantly improves our approximate solution. Of course, if an equation 
does not have a solution (or if the algorithm does not find one) then we want the 
algorithm to fail gracefully. Ideally, the method should be able to identify when 
the method cannot make significant progress. Simply having some improvement 


198 3 Solving nonlinear equations 


Table 3.3.4 Example of failure of guarded Newton method with an any decrease strategy 


k Xk f Ok) 

0 +3.162015182 +6.303807760 
1 —3.079829414 —6.220782816 
2 +3.069566528 +6.210397393 
3 —3.065443184 —6.206222764 
4 +3.063612597 +6.204369002 
5 —3.062768079 —6.203513705 
6 +3.062371889 +6.203112440 


(f Oren < |f (x) for all k) is not sufficient to drive | f (x;,)| toalocal minimum. An 
example of this is f(x) = 2x +a sin(x) where a = —1/(cos(x) — sin(x)/(2x)) * 
0.990279, * = m — 1/(47), and x» = x + 0.1. The unique solution to the equation 
Ff (x*) = Ois x* = 0. The iterates and their function values are shown in Table 3.3.4, 
which demonstrates | f(x;,)| decreasing in k with limit | f(+)| 4 0. This kind of 
failure is rare, but can occur. 

Instead we will require a “sufficient decrease” in the function values that is achiev- 
able. Since 


d 
ae f (xe + sdy)|s20 = f'n) de and 
i = =f ef Ga) So 


d _ 
ae fe + Sdk)ls0 = FOr) ( fee) = —f (xx). 


Therefore, using Taylor series with second-order remainder, we have 


1 ad? 

2 ds? 
1 

= f(x) — 5 f (xe) + sf" + csdy) dys? 


d 
f(x + 8 dk) = f (eK) + as f (xR + Sdk)|s—9 8 + f Xk + Sdp) lye, + 8 


1 
=(1—s) f(x) + af Or + c.dy) (sdy)? 


for some c, between 0 and s. We can operationalize the idea of “significant improve- 
ment that is achievable” by requiring that 


1 
(3.3.2) lf Gr + sd) = — 58) Fo. 


This is achievable because for sufficiently small s > 0 we have 
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Algorithm 42 Guarded Newton method 
1 function gnewton(f, f’, x0, €,n) 


2 k<0 

3 while |f(xx)| >€ 

4 dk <— —f (x) /f' (xe) 

5 s<l 

6 while | f(x +sdk)| > (1-55) If Gn) 
7 s<s/2 

8 if s<n then return x, 
9 end while 

10 Xktt <— Xp +5 dg 

fel k<k+l1 

12 end while 

13 return Xxx 


14 end function 


Table 3.3.5 Guarded Newton method near a local minimizer of | f (x)| 


k Xk dk Sk F (xk) 

0 — | 0.500000000000000 | —1.3621 x 10° Q-2 1.23437500 
1 | 0.159482758620690 | —3.3123 x 10° 2-5 1.02492769 
2 | 0.055972268830070 | —9.0558 x 10° 2-8 1.00311098 
3 | 0.020597954605748 | —2.4379 x 10! g-11 1.00042318 
4 | 0.008694301949811 | —5.7607 x 10! g=13 1.00007551 
5 | 0.001662175199375 | —3.0091 x 10? 2-18 1.00000276 
6 | 0.000514312730217 | —9.7227 x 102 2-71 1.00000026 


fm + sd) < A —s) If) + Md)’, 


where M is a bound on f”. As long as 


1 
(1-5) [ful + M(sd)” < 55) fw 


the sufficient decrease criterion (3.3.2) is satisfied. Algorithm 42 shows how to imple- 
ment a guarded Newton method. 

This guarded Newton method can still fail: take, for example, f(x) = 1+ x? — 
(x/2)? and x9 = 1/2. Since we require that | f (xz1)| < |f (xx)|, the magnitude of 
the function values must decrease, which means that the iterates tend to become 
trapped near a local minimizer of | f (x)| that might be far from the solution. In the 
case of this function, x = 0 is a local minimizer of | f(x)| while the true solution 
is x* © 8.12129. Table 3.3.5 illustrates how the guarded Newton method behaves 
when it gets trapped near a local minimum of | f (x)|. 

The behavior of our guarded Newton method can be summarized as follows. 
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Theorem 3.9 /f x,, k =0,1,2,... are the iterates of the guarded Newton method 
Algorithm 42 with f twice continuously differentiable and the iterates are bounded, 
then either f (x,) > Oask > ov, or f'(x,) > Oask > ov. 


Proof Suppose that f(x,) 7 0 as k > oo. Let B be a bound on |x;,| and M a 
bound on le” (x)| for |x| < B. Also, let s; > 0 be the value of s used in line 10 of 
Algorithm 42, so that x44 = x% + Sedk. 

Since | f(x,)| > | f @r41)| > 0 for all k, the sequence | f(x,)| converges to a 
limit f* as k > oo. Since f(x.) # 0 as k > ~, it follows that f* > 0. Thus 
If een /|f Gel > f/f = 1 as k > 00. As [feu < Ud - 550) IF Gel, 
it follows that s, > 0 as k > oo. Since f(x, + sgdy) = f(x) + fF’ Org) Sedk + 
Sf" (aK + cpspdy)(spdx)* for some 0 < cy < 1, using the bound M on | f"x)| for 
|x| < B, we see that | f(x, + sdk)| => A — 5) |f(xx)| - Ms?d?. We choose sx to 
be the first value of s that is a negative power of two where | f(x, + sdx)| = 
d — 55) | f (xx) |, which will happen if M s? d; < 5 | f (x,)|. We can therefore guar- 


antee that the value of s = s, used is > 5 [5 lf (xx) | /(Ma2)]'”, and so sx |dy| = 


[f*/M]'” /(2/2). Since sz > Oitthen follows that |dy| = | f (xx) / | f’(xx)| > 00 
as k — oo. But this in turn implies that f’(x,) — 0, as we wanted. 


Note that the use of the sufficient decrease criterion (3.3.2) is needed to ensure that 
Sx — O if f(x.) % O. This means that there are circumstances in which we can 
guarantee that the guarded Newton method converges: if {x | | f(x)| < | f(xo)| } is 
a bounded set, and f’(x) 4 0 for any x in this set. 


3.3.4 Variant: Multivariate Newton method 


If we wish to solve a system of nonlinear equations f(x) = 0 where f: R"’ > 
IR”, we can use a well-known variant of Newton’s method. Note that for d ~ 0, 
f(xt+d)~ f(x) + Vf (x) d where V f (x) is the Jacobian matrix: 


Af, /Ox\(X) Of, /dx2(x) +++ Of, /AXn (x) 


Of2/Ox1(X) Ofo/Ox2(X) +++ Af2/IxX, (x) 
(3.3.3) Vif(x) = : ; 


af, /axi(x) af, /Ax9(x) vee Bf /Bx_(x) 


Solving f(x) + Vf(x)d =0 for d instead of directly solving f(x) = 0 means 
solving linear equations. In fact, d = —V f (x)~! f (x), and so the improved approx- 
imation for the solution is x +d = x — Vf (x)! f (x). Repeating this improvement 
gives the multivariate Newton method, as shown in Algorithm 43. 

The convergence theory follows that for the one-variable case, except that we must 
be more careful in our application of Taylor series with remainder, which should be 
kept in integral form. 
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Algorithm 43 Multivariate Newton method 
1 function mvnewton(f, Vf, x0, €) 

2 k<0O 

3 while || f(xx)|l. > € 

4 Kept — xe — VS (xu) | fen) 

5 k<ek+l1 

6 

7 

8 


end while 
return Xx 
end function 


Theorem 3.10 (Convergence of multivariate Newton’s method) Suppose that 
f: R"’ = R‘ has continuous second derivatives, f (x*) = 0, and V f (x*) is invert- 
ible. Then there is a 5 > 0 where ||x9 — x*|| < 6 implies that x; — x* and there is 
a constant C such that ||X44, — x*|| < C ||x, - x* |? fork =O, 2 noe 


Proof We start with multivariate Taylor series with second-order remainder in inte- 
gral form (1.6.5): 


1 
fatd=soytvsaa+ f (1 —s) D* f(x + sd)[d, d|ds. 
0 


We first want to find 5 > 0 where ||xo — x*|| < 6 implies ||x; — x*|| < 6 fork = 
1,2,.... Let M = maxy.jy_y*|<1 | D? f (y)||; we will ensure that 6 < 1 in what fol- 
lows. Noting that if A is invertible, then B is invertible provided || A~!| |B — Aj] < 1 
by (2.1.1). If A= Vf (x*) and B = Vf(y) and ||y — x*|| < 1 then ||B — All < 
M ||y — x*||. To ensure invertibility of B = V f(y), we require ||A-! | |B -— All < 
1/2; this is guaranteed provided we additionally require that ||y—x*|| < 
1/(2||A~!|| M). Set 8; = min(1, 1/(2||A7'|] M)). This requirement also ensures 
that 


Av! 


|2 "|< < 
1-— ||A-'(B - A)| 


Let e, = x, — x*. Then 


O= f(x") = far — ex) 
= fx) + VF) (-ex) + M, 
with nx | <M le;||?, so 
S (Xn) = VF (xndex — M- 


Therefore 


Cre = Xpp1 —X* = (xe — VS (x) | f (xR) — x" 


= xy —x* — VF (xx) | f (xy) 
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=e — Vf (xe) | [VF vex — ne] 
=e — eg + VS (xn) me = VS (xR) M- 


Then provided ||ex|| < 6), 


llecsill < ||V Fee)" | ng | 
<2 |Vf@*) "| M llexll’. 


Provided 2 IVF) M |lex|| < 1/2 we have |lez+i|| < 5 lle, ||. Let 6. = 
min(5,, 1/(4 || Vf(x*)"!|| M)). Then |lex|l < 42, implies llec+1ll < 3 lleell < 42. 
By induction, it is clear that if ||eo|| < 52 then ||e;|| < 5. fork = 1,2,.... Further- 
more, ||ez+1|| < 5 lex || for all k and so e, — 0 as k — ow. For the rate of conver- 
gence, we use 

llectill <2 |VF@*) "| M lec. 


Choosing, C = 2 |Vf(x*)! | M gives |lex+1|| < C lle; ||? as we wanted. 


If xo is not close to x*, we have no guarantee that Newton’s method converges. We 
can instead make Newton’s method more reliable by using a multivariate guarded 
Newton method (see Algorithm 42). Note that 


d d 
7, Mle + sd)\|3) = ay (fe + 5d)" f(x + sd) 


=2f(x+ sd)? — fx + sd) 
=2f(x4+sd)'Vf(x +sd)d. 


If we choose d to be —V f (x)~! f (x) then 


d 
= (If(e+sd)\3)| =2f@tsd)’Vf(x+sd)d|_, 


s=0 


= —2 f(x) Vf (x) f(x)! f(x) 
= -2 If @)I3, 


so 


= —-IF@lhb- 


d 
ae (If “F sd)||) cs 


Thus our sufficient decrease criterion can be written as 


1 
(3.3.4) II f(~ + sd)ly < Ud — A») Il f(X)Ilo- 


3.3 
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Algorithm 44 Guarded Newton method 


1 function mvgnewton(f, Vf, x0, €, 1) 


2 k<0 

3 while || f(xx)|l, > € 

4 dy —-Vf (xn) | f (xx) 

5 s<l 

6 while f(x +sde)lly > U- 55) If @edllz 
7 s<—s/2 

8 if s<n: return x, 
9 end while 

10 Xe — xX, +s dy 

11 k<k+1 

12 end while 

13 return Xx 


14 end function 


A complete multivariate guarded Newton method is given in Algorithm 44. 

Theorem 3.9 can be modified to show that using (3.3.4) for the sufficient decrease 
criterion (line 6 of Algorithm 42), provided that the iterates x; are bounded and that 
f has continuous second derivatives, results either in x, — x* as k > oo where 
FS (x*) = 0, or sp — 0, ||dx|| — oo, and |V f(x)! | — ooask > oo. IfX is any 
limit point of the x;’s that is not a solution, then Vf (X) is not invertible. The proof 
of this follows Theorem 3.9, except that we first use || f(xx)|| | f* ask > oo. 


Exercises. 


() 
(2) 


(3) 


Apply Newton’s method to solving e* + x? — 4 = 0 with starting point xo = 1. 
How many iterations are needed for an accuracy of about 107!?? 
A method for computing square roots is to apply Newton’s method to solving 


x? — a = 0 for x. Show that Newton’s method applied to this equation is 


a 
Xnt+1 = = (Xn + ). 
2 Xin 


Show that the iteration function g(x) = 5 (x + a/x) has a minimum atx = /a 
and g(./a) = ./a, so that x,4, > ./a. Show that Newton’s method is globally 
convergent for any positive starting point. [Hint: Show that 0 < g’(x) < 5 for 
all x > ./a, so that g is a contraction mapping on [./a, 00).] 

While Newton’s method for square roots is globally convergent (Exercise 2), 
if we are to use this to create an efficient method for all inputs, we want to 
ensure that the starting point is close to the solution. Since for floating point 
numbers, we have a > 0 represented by a = +b x 2° where | < b < 2, show 
how we can shift a to 2””"a in the interval [5 , 2) without incurring any roundoff 
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error. How many iterations of Newton’s method will guarantee a relative error 
of no more than 10~!° for x9 = 1 and 5 <a <2? (Hint: If u, = x,/./a, then 
Und = 5 (Un + 1/u,,). First consider 4 <a < 1; start with ug = x9/./a which 
is in the interval [1, /2]. The worst case is up = /2 in the iteration Unt = 
5 (Un + 1/un).] 

(4) Apply Newton’s method to the equation x” = 0 withm > 1. How quickly does 
Newton’s method converge for this problem? 

(5) The guarded Newton method does not guarantee convergence to a solution, 
even if one exists. Apply the guarded Newton method to f(x) = ix? —x-+10 
with x9 = 0.7. What (if anything) do the iterates x, converge to? 

(6) Show that for the guarded Newton method applied to solving f(x) = 0 for 
f: R— R, either the iterates x,, k = 1,2, ... are unbounded, or f(x,) > 0 
as k > oo, or f’(x,) > Oask > oo. 

(7) Show that it is possible that the iterates x,, k = 1,2, ... are unbounded for the 
guarded Newton method: use f(x) = e~* and xp = 0. 

(8) Apply the guarded multivariate Newton method (Algorithm 44) to the function 


: 
_ yx — y _|* 
(a\2 27 «| where x = || 


with xo =[1, 1]”, xo = [1, —$]", and with xp = [5, —$]”. Report the solu- 
tion you obtain with a tolerance 10~!°. How many iterations does this method 
need? 

The guarded multivariate Newton method (Algorithm 44) uses the 2-norm. 
Would the method work well with a different norm here? Try the previous 
exercise with the 1-norm and the oo-norm? Specifically answer the questions 


for using a norm different from the 2-norm: 


(9 


YS 


(a) For f smooth and V f (x;) invertible, is the line search in lines 5—9 guaran- 
teed to terminate in exact arithmetic (even with n = 0)? 

(b) Does the method accept s = 1 in line 6 for x, ~ x* provided V f(x*) is 
invertible and f smooth? 


(10) A\In this extensive exercise, we will establish bounds on the number of Newton 
steps for Algorithm 44 under some common assumptions: 


|(Vf@) - VEO) VF) || < Lilx - yl 
|Vs)'| <M 


for all x and y. 
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(a) Show that ford = Vf (x)~! f(x), 


f(x+sd) = f(x) +sVE(x)d + i [Vfl +1d)— Vf (x)]ddt 
0 
=(1-—s)f(x)+7(s) — where 


1 1 
IIn(s)| < abs Id If @)\| < 5 bMs" If xy|l. 


(b) Show that the sufficient decrease criterion || f(x + sd)|| < ( — $3) ll fx) || 
is satisfied if s || f(x)|| < 1/(LM). 

(c) From (b) show that if x = x, then the chosen step length s; in Algorithm 44 
satisfies either s, = 1 or 1/(LM) > sy || f (x4) || = 5/(LM). 

(d) Show that 


Fenn <0 — 55) Fn < max FCW Fenl— =) and 


1 1 
EM ||f@e+ll 5 5 (LM If@ol? if LM fell < 3" 


(e 


wm 


Combine these results to show that for the first phase (if LM || f (x,;)|| > 5) 
we have at most4L M || f (xo)|| iterations, and a quadratic convergence phase 
(where LM || f (xx)|| > 5) which has no more than | + log; (logy (LM /e)) 
iterations to achieve || f(x,;)|| < ¢. Note that log} (u) = log,(u) if u > | 
and zero otherwise. Give a bound for the total number of iterations needed. 
(f) Give a bound on the total number of function evaluations (not just Newton 
steps). This takes into account the cost of the line search. 


3.4 Secant and hybrid methods 


There are a number of ways of improving on Newton’s method. One is to avoid the 
need for computing derivatives. Another is to improve reliability by incorporating 
aspects of the bisection method. The first approach leads to the secant method, while 
the second approach leads to Regula Falsi and other hybrid methods such as Dekker’s 
method and Brent’s method. 

All of the methods of this section are restricted to functions of one variable. 


3.4.1 Convenience: Secant method 


Computing derivatives can be a difficult task, especially if the function is defined 
through a complex piece of code. Instead, if we have already computed f(x;) and 
FS (Xe-1), we can approximate f’(x,) © (f (xn) — f e-1))/ Oe — X~-1). Substitut- 
ing this into (3.3.1) gives the secant method: 
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Algorithm 45 Secant method 

1 function secant(f, xo, x1, €) 
2 k<1 
3 while |f(xx)| >€ 
4 Xp << XK — FR) OK = XK-)/ FOR) — f K-11) 
5 k<ek+1 
6 
i) 
8 


end while 
return Xxx 
end function 


Table 3.4.1 Comparison of Newton’s method with secant method for f(x) = e* — x2 —3 
Newton method Secant method 
ko | Xp f x) Xk fr) 
0 1.000000000000000 | —1.2817 x 10° 1.000000000000000 | —1.2817 x 10° 
1 2.784422382354665 | +5.4375 x 10° 2.000000000000000 | +3.8906 x 107! 
2 2.272498962410127 | +1.5394 x 10° 1.767140238028185 | —2.6870 x 107! 
3 1.974092118993297 | +3.0304 x 107! 1.862265070280968 | —2.9728 x 10-2 
4 1.880903342392723 | +2.1630 x 10~? 1.874098597392787 | +2.6983 x 10-3 
5 1.873171697204092 | +1.3577 x 10-4 1.873113870758385 | —2.3969 x 10-5 
6 1.873122549684362 |+5.4455 x 107° 1.873122540803784 | —1.9086 x 10-8 
7 — | 1.873122547713043 | —4.4409 x 107!© | 1.873122547713092 | +1.3456 x 107}3 


Sf (x) ae Sf Xk) (Xk — Xe-1) 
(Fo—fGi-v)/Gr-xa) Feod— for) 


(3.4.1) Xk+1 = Xk 


Note that xz; is the point where the chord line passing through (x,, f(x,)) and 
(xx-1, f (%g-1)) crosses the x-axis. Thus, the right-hand side of (3.4.1) is actually 
symmetric in xz and x;,_1: 


Xe—1f (KK) — Xf XR-1) 
fxn — fOr) | 


Xk = 


The secant method is shown in Algorithm 45. 

An example of using the secant method compared to the Newton method is shown 
in Table 3.4.1. 

As can be seen in Table 3.4.1, secant has a “head start” (probably because of the 
choice x; = 2, which is much closer to x* than xo), but Newton’s method overtakes 
the secant method in terms of accuracy. Nevertheless, the secant method has impres- 
sively fast convergence. The reason for this can be seen in the following theorem. 


Theorem 3.11 Jf f has continuous second derivatives and f (x*) = 0 but f'(x*) # 
0, then there is a 5 > 0 where |xq — x*| < 6 and |x; — x*| < 6 implies x, > x* 
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as k — oo. Furthermore, if xp —> x* as k > ©, (xp41 — X*)/C(XK — X*) R11 — 


x*)) > f(x") /(2 f(x") as k > 0. 


Proof Note that the chord line for computing x;,, is the linear interpolant p, | (x) 
of data points (x,, f(x,)) and (x,_1, f(xg_-1)), and x4) is where this chord line 
crosses the x-axis: px,1(Xx¢+41) = 0. From the error formula for linear interpolation 
(4.1.7), 


I, ; 
f(x") = pri") = at (cx) (x* — xg) (x* — x¢-1), 
for some cy between min(x,_1, x%, x*) and max(x;,_1, xz, x*). Therefore, 


SF (xx) — f OK-1) 


Xk — Xk-1 


1 
(=a) = xf edo" = XR )(x* = XK-1), 


pi.1 is linear with slope (f (xx) — f (%e-1))/ (xe — x41), and pei (xe41) = 0. The 
fraction (f (xx) — f (xe-1))/(xXk — %e-1) equals f'(d,) for some d, between x,_1 
and x;. So 


1 
— f' (d)(x* — x1) = Coles — XK)(K* = Xp-1). 


Writing e; = x; — x* we getex41 = exex—1 f (cx) /(2f' (d)). Choose 6, > Oso that 
| f’(2)| > 5 rae for all z where |z — x*| < 6). Let M = maxy,|y_y*|<5, | f”(w)| / 
(5 |f’@*)]). Then lexyi] < M lex| lex—1| provided xz, xx1 € [x* — 51, x* + 41]. 
Let 62 = min(6;, 1/(2M)). Then provided |e;| , |eg,_1| < 62 we have |ex41| < M |e,| 
lex1l < 3 lexl <b. 

Thus, if |eo| = |xo — x*| < 52 and |e;| = |x; — x*| < 62, we have |ez,1| < M |e;| 
lex1| < ; lex| < 62 for all kK = 1,2,3,.... This gives us convergence: e, — 0 as 
k —+ oo and thus x, — x* as k — oo. Furthermore, cy, — x* and d, — x* as 
k > oo; finally, 


Grey — 2") om fC) Fe) 


lim = = 
k>00 (X_ — X*)(Xp-1 — X*) ko Df (de) 2 f'(x*) 


by the Squeeze theorem, as we wanted. 


Letting e; = x; — x* we see that e,4)/(ecex_-1) > C= f"(x*)/(2 f’(x*)). Thus 
if there is convergence, for sufficiently large k, we have |ex.41| < 2C |ex| |ex—1|. Mul- 
tiplying by 2C gives 

2C |exsil < 2C lex| 2C€ lex_i]. 


Taking logarithms, 
In(2C lexs1l) < INQC Jeg|) + INC Jex_i)). 


Setting nj = In(2C le; ) we get x41 < Nk + M-1- If we have no, n; < 0 then we see 
that n, — —oo as k — ov, and furthermore the size of 7, grows exponentially fast. 
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How fast is this growth? We can determine that by looking at the linear recurrence 
(3.4.2) +1 = Mm +M-1, No = No andj, = 1. 


It is easy to check that n; <7; for j = 0, 1,2,... by induction, so 7; is an upper 
bound on yn; = In(2C le; |). We can solve (3.4.2) exactly: 7; = Cyr! + Cord where 
r; and rz are solutions of the characteristic equation r2 =r +1. The solutions are 
the Golden ratio @ := (1 + V5)/2 © 1.6180 and —1/@ © —0.6180. This means that 
In(2C |ex|) *& —const gk for k large, provided eg and e; sufficiently small. 

In comparison with Newton’s method where we have |ex41| < C’ lex|? so In 
(C’ |ex41]) < 2 In(C’ |ex|). This means that In(C’ |ex|) + —const 2" provided eg is 
sufficiently small. The number of iterations needed for Newton’s method are there- 
fore generally less than for the secant method, but only by a factor of about 
log ¢/ log 2 ~ 0.69. Considering that one iteration of Newton’s method requires 
both a function and a derivative evaluation, while the secant method only requires 
one function evaluation, the additional iterations are often a price worth paying. 


3.4.2 Regula Falsi 


Regula Falsi, at least as the name is usually understood today, is an attempt to combine 
the speed of the secant method with the reliability of the bisection method. While it 
does achieve the reliability of the bisection method, it fails to obtain the speed of the 
secant method, and can even be slower than bisection. 

The idea is to take the bisection method, but to compute the new point c using 
the result of the secant method: c < a+ f(a)(b—a)/(f(b) — f(@) instead of 
the midpoint m = (a + b)/2. We then update either a < c or b < ¢ according to 
the sign of f(c). This is shown in Algorithm 46. Note that as f(a) and f(b) have 
opposite signs, the zero of the linear interpolant (which is c) must lie between a 
and b. 

Applied to the function f(x) = e* — x? — 3 with a = 1 and b = 2, we get the 
results shown in Table 3.4.2. Note that the values a, and b, are the values at the end 
of the kth execution of lines 6-12 of Algorithm 45. 

Several things are apparent from Table 3.4.2: by =bo for all k, and f (ay+1)/f (ay) * 
1/10 for k > 2. It appears from this that the ratio of successive errors e,+1/ex does 
not go to zero, indicating a linear rate of convergence similar to general fixed-point 
iterations. This method does not appear to have the rapid convergence of Newton’s 
method or the secant method. 

To analyze this method to see why this happens, suppose that f’(x) > 0 for all 
x € [a,b]. If p(x) is the linear interpolant of (a, f(a)) and (b, f(b)), then for 
some d, between a and b, by (4.1.7), 
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Algorithm 46 Regula Falsi method 
1 function regulafalsi(f, a, b, €) 
2 if sign f(a) 4 —sign f(b) 
3 return fail 
4 end if 
5 while min(lf@I|,|f()) > « 
6 
7 
8 


c<—at flajb—a)/(f(b)— f@) 
if f(c)=0: return c 
if sign f(c) = sign f(a) 

9 a<c 

10 else 

11 b<c 

12 end if 


13 end while 
14 if |f(@~| <\|f(@)| return a else return b 
15 end function 


Table 3.4.2 Results for Regula Falsi applied to f(x) = e* — x? — 3 with initial [a, b] = [1, 2] 


k | ag f(a) br fx) 

0 1 —1.28172 x 10° 2 0.389056 
1 1.76714023802819 —2.68697 x 107! 2 0.389056 
2 1.86226507028097 —2.97277 x 1072 2 0.389056 
3 1.87204229781722 —2.98139 x 1073 2 0.389056 
4 1.87301539863046 —2.95957 x 1074 2, 0.389056 
5 1.87311192293272 —2.93490 x 107° 2 0.389056 
6 1.87312149420393 —2.91015 x 107° 2 0.389056 


f(x) — pix) = f mo (x —a)(x—b) _ so for the new point c, 
fo -0=foO-moe= re he —aj(c—b) <0. 


That is, the sign of f(c) will always be negative. This means that only one end 
of the interval will be updated in the body of the while loop. A similar result 
holds if f”(x) < 0 for all x € [a, b]. Then if, as in our example, a, increases toward 
the solution x*, but by = bo for all k, then by — ay — bo — x* #0. The slope of 
the chord line will not approach the slope of the tangent line at x*. This gives us 
linear convergence rather than the accelerated convergence of the Newton or secant 
methods. 

We now make these arguments more precise. Let p;,,(x) be the linear interpolant 
of (ax, f(ax)) and (by, f (bg) so that cx = ag — f (ax) (be — a)/(f Ok) — Fak) 
is the solution of p1,,(cx) = 0. Assume that f”(x) > 0 for all x € [ao, bo], and 
f(a) < 0. By the above arguments, f”(x) > 0 for all x € [ax, by], f(a) <0 < 
Ff (bg), and by = bo for all k. Then 
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_ Flayibs a) 
f (by) — fax)” 


Anti = Ce = Ag 


As with the analysis of the secant method, we note that f(x*) = 0, so 


f" (dh) 
(x 


mr ax) (x* — by), 


=pisee v= f= pias y= 


for some dy between a, and by. Noting that p).(x*) =[(f (bp) — f(a))/ 
(by — ax) |(x* — cx) since p1,~(cx) = 0, we have 


fbi) — FG)» _ FG) 
(x* — ck) = (x 


* _ bp), 
eae y a) (x ) 


Let e; = x* — a; be the error in a;. Then as cy = ag41, 


f (bk) - CO 


1 
= k-+1 af (dx) ex (x K) 


Thus 


erst Sf" (de) (de — 1) 
ee Af (be) — f(a) 
, LC) bo = ae 

2(f (bo) — f (x*)) 


(x* — dy) 


ask > o, 


demonstrating linear convergence. 


3.4.3 Hybrid methods: Dekker’s and Brent’s methods 


The first method that truly combined the reliability of bisection with the speed of 
the secant method was Dekker’s method [74]. Dekker’s method was later eclipsed 
by Brent’s method [32, Chap. 4]. 

Dekker’s method keeps an interval [min(a;, b;), max(axz, by)] so that f(a;,) and 
Ff (bx) have opposite signs, but with | f(b;,)| < | f (a)|. Unlike the bisection method 
or Regula Falsi, Dekker’s method does not treat a, and by symmetrically, but b; is 
meant to be the better approximation to the solution. Dekker’s method uses the secant 
update on the b,’s 
f (dx) (Ox = B-1) 
f (bn) — f Oe)” 


Sr < by 


where applicable, but uses the midpoint my, = (ag + bx)/2 where it is not applica- 
ble (f (bg) = f (be-1)). The choice then is to use either s,; or m,. If sz is strictly 
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Algorithm 47 Dekker’s method 
1 function dekker(f, ao, bo, €) 
2 if sign f(ao) # —sign f (bo) 
3 return fail 
4 end if 
5 k<0O; b_1 < a9 
6 
% 
8 
9 


if |fO~| > lf(@)l: swap ag and by 
while |f(bx)| > € 
My <— (ag + bg) /2 
Lf fbn) A f(be-1) 
f x) (bk = be-1) 
fbi) — f Or-1) 


Sk < ie 


Bb 


else 
Sk <— Mg 

end if 

Lif my < sp < bE ov be < SK < me 
Det <— Sk 

else 
bey <— mk 

end if 

if sign f(be+1) = —sign f(ax) 
Ak+1 <— ak 

else 
A+ <— de 

end if 

LE [fee vl > fad: swap agi: & desi; end if 

k<k+1 

26 end while 

27 return bx 

28 end function 


=a 


et Sf k 
ANA URWNHR OC 


Bb 


NNER 
row 


MO NMNN ND 
OO ®wW Nh 


between b; and m, then we set by; <— 5,, otherwise by,; <— m,. This ensures that 
byy1 € [min(ag, by), max(ag, by)], and also that |bg41 — be| < 5 lax — by|. To obtain 


ax41 we choose between a,x and by: if sign f (bg41) = sign f (bg) = —sign f (ax) 
then we choose a4, < ax; if sign f (bp41) = —sign f (bz) = sign f (ax), we choose 
ax41 <— by. This maintains the sign invariant: sign f (ay41) = —sign f (by+1). Finally, 


we swap dysi and best if |f(bea1)| > |f(@eu)|. This ensures that | f(bk+1)| < 
| f (ae41)|. The complete algorithm is shown in Algorithm 47. 

Since [min(ag41, be41), max (ag41, be+1)] | [min(ag, be), max(az, by)] there is 
no possibility of the method “blowing up”. The condition that sign f(a.) = —sign 
F (by) ensures that there is a zero of f between a, and bx, but does not guarantee that 
the number of iterations is better than bisection. It does, however, perform nearer to 
the secant method where the secant method converges rapidly. There are cases where 
the secant method converges slowly and Dekker’s method exactly tracks the secant 
method. 

Brent’s method is an improvement on Dekker’s method in several ways. Inverse 
quadratic interpolation is used based on the data points (az, f (ax)), (be, f (bx)), and 
(by_1, f (bg_1)). Because it uses inverse quadratic interpolation, there is no need 
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to solve a quadratic equation; however, it does require that f(by) A f (bg_1) and 
St (ax) A f (be_1). If any of these conditions fails, we resort to linear interpolation. 
The condition that sign f (a,) = —signf (b;,) ensures that f(a,) # f (b,). Because 
inverse quadratic interpolation is used, the condition for accepting the secant or 
inverse quadratic interpolation estimate has to be expanded a little: instead of requir- 
ing that the inverse quadratic interpolant estimate s; lies between mg = (ax + by)/2 
and by, we only insist that the interpolant estimate s;, lies between (3a, + b;)/4 and 
b,. Brent’s method aims to guarantee a halving of the interval width |b, — a,| at 
least every two iterations. To do this, if the previous step was not a bisection step (as 
indicated by mflag = false), and |s, — bx| = ; |by_1 — bg_2| on line 16, then we do 
a bisection step. 

Ensuring the reduction of the interval width |b, — a,| by a fixed ratio in a 
fixed number of iterations guarantees that |b, — a,| — O exponentially as k > 
oo. To see how rapidly using inverse quadratic interpolation is, suppose that 
P2,() be the quadratic interpolant of the data points (f (ax), ax), (f (bx), Dx), and 
(f (be-1), be—1). Supposing that the inverse function f~! is well defined and smooth 
on [min(az, by, be—1), max(ax, by, bg_—1)] C [min(ao, bo), max(ao, bo)], we have 
from the error formula for polynomial interpolation (4.1.7), 


(f7'Y" xy) 
3! 


1 
f'O) — Po.) = — ZF") f (ax) fbx) f (be-1). 


f'0) = Pre) = (y— fa) (y — fd) (y — fe-1)), 80 


Now f~!(0) = x*, the solution we seek, and P24 (0) = De+1, 80 


1 
bey —x* = gf)" &) f (ag) f (be) f ber) 


= O((ag — x*)(by — x")(be—-1 — X")). 


Thus we can expect rapid convergence. We even get superlinear convergence without 
small ja, — x*|. 


Exercises. 


(1) Use Regular Falsi to solve e* + x? = 4 to 10 digits of accuracy. How many 
function evaluations does it use? 

(2) Use Dekker’s method to solve e* + x? = 4 to 10 digits of accuracy. How many 
function evaluations does it use? 

(3) Use Brent’s method to solve e* + x? = 4 to 10 digits of accuracy. How many 
function evaluations does it use? 

(4) Use the three above methods to find the solution of x = tan x closest to 37/2. 
Give its value to 10 digits of accuracy. Report the number of function evaluations 
needed. 
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Algorithm 48 Brent’s method 


1 function brent(f, ao, bo, €, 6) 
2 if sign f(ao) # —sign f (bo) 
3 return fail 
4 end if 
5 k<—0; b_1 <ao; mflag < true 
6 if |f(b)| >If(@)|: swap a&by; end if 
7 while |f(bx)| >€ and |bky —ay| > 6 
8 my <— (az + dy) /2 
9 1£ f(be) A fx-1) and flax) F f(be-1) 
// inverse quadratic interpolation 
ax f (be) f Or-1) 
10 Sk < t 
(f (ax) — FOR) SF an) — fOr.) 
by f (ax) f Ox-1) ; 
(F (be) — Fan) f ©) = fOr) 
by—-1 f (ax) f (br) 
(fF bk-1) — Fak) (F Or-D) — FE) 
Ta: else // secant update 
f (bk) (bk = a) 
12 Sk <— dy 
F (bk) — flak) 
13 end if 
14 if sg not between by and (3a, + by)/4 or 
15 (mflag & [ |se — del = 4 |be — be-1| or b> 
|by — be-1] 1) or 
16 (not mflag & [ |sxK — bx| = 5 |be_-1 — be-2| or 6 > 
|be—-1 — be-2| 1) 
17 Sk <— mz 
18 mflag < true 
19 else 
20 mflag < false 
21 end if 
22 if sign f(s,) = —sign f (ax) 
23 Ake) <— ag; ber <— sx 
24 else 
25 best << bee Ake <— SK 
26 end if 
27 1£ |frvl > |f@eevl: swap ag+1 & beyi; end if 
28 k<k+l1 
29 end while 
30 return by 


31 end function 


(5) Suppose a modified hybrid method alternates between bisection updates (c, <— 
(ax + by) /2) for even k and secant updates (cy <— ax — f (ag) (be — ax)/ 
Cf (be) — f (ax))) for odd k, and a, <— c, if sign f (cx)=sign f (ax) and by <—cx 
otherwise. Show that |by+2 — ag+2| / |bg — ax| > 5 as k — oo. Explain why 
this limits the performance of the hybrid method. 
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(6) 


(7) 
(8) 


(9) 


(10) 


3 Solving nonlinear equations 


Implement a hybrid method using the following update code: 


Cea — f (ax) (bea) /(f (be) — f(a) // secant estimate 
if G between aq and ay + qk: ce <— 2¢, —ae; end if 
if G between bk, and be + 5A: Ce <— 2¢, — by; end if 
if sign f (cy) = signf (ap): ag <— cx 

else: by <— cx. 


Use this for solving e* + x” = 4. Do the secant estimates ¢ have superlinear 
convergence to the exact solution? 

Are there conditions in which inverse quadratic interpolation can fail or cause 
numerical problems? Explain. 

Apply Regula Falsi, Dekker’s method, and Brent’s method to solving x? 
sin(1/x) = 0 in the interval [—3/(27), 2/7]. Do they find the same solution? 
Compare with the results from the bisection method. 

Trust region optimization methods often have to solve the equation 
| (B+ AD)~'g|, = A ford given B symmetricn x n and g € R". Trust region 
methods also need B + AI positive semi-definite (so that A + Amin(B) = 0). 
Find an interval [a, b] for A where a, b > —Amin(B) and | (B+AI'g I, —A 
has opposite signs at A = a and A = b. To find a and b, you should use only the 
quantities Amin(B), Amax(B), and || g||,. Adapt a hybrid method of your choice 
to solve this problem. You should assume that there is a function isposdef (A) 
that returns true if matrix A is positive definite and false otherwise. 

It is possible to use bisection and hybrid methods in two dimensions in 
certain circumstances. Consider a rectangle R = [a, b] x [c,d] and func- 
tions f, g: R? > R with the properties f(a, y) < 0 and f(b, y) > 0 for all 
y € [c,d], and g(x,c) <0 and g(x,d) > 0 for all x € [a,b]. In addition, 
assume that f(x, y) is a strictly increasing function of y so that for each x 
there is one and only one solution y = (x) of f(x, y) = 0. See Figure 3.4.1 


! f(z, d) >0 
y=d : 
\ 
rn : G(z,y) =0 
nae \ oe ion 
g(a,y) <0 ys ee g(b,y) > 0 
\ wr a 
\. SO 
f(z, y) =0 2 ae ae 
y=c 
Lr=a f(a,c) <0 xc=b 


Fig. 3.4.1 A special case where bisection and hybrid methods can solve a two-dimensional problem 
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for an illustration. Implement a solver based on nested bisection or hybrid 
methods. The inner method computes y = ¥(x) that solves f(x, y) = 0 for 
a given x by solving for y. The outer method solves g(x, ¥(x)) = 0 using a 
hybrid method solving for x. Test this on the rectangle [0, 1] x [0, 1] for func- 
tions f(x, y) =e* — x? +e%(1+x)— land g(x, y) = sin(ry/2) —x + xy. 
Argue that without the requirement for f(x, y) be increasing in y this method 
can fail. 


3.5 Continuation methods 


When faced with a highly nonlinear system of equations, so far we just have multi- 
variate Newton’s method or one of its variants. Ensuring convergence, even assuming 
V f (x*) is invertible, requires having an initial guess xo that is “close enough” to 
the true solution x*. In the absence of good leads as to where x* is, we can pick 
Xo “at random”. Hopefully, if we try this enough times, we will chance on an x9 
that is “close enough” to x*. But if the dimension is fairly large, the chance of this 
happening can be extremely small. 

Here is another approach that can work in fairly general situations. The approach 
we develop here is fleshed out in more detail in [4]. We start with the system of 
equations we want to solve: f(x) = 0 where f is a smooth function R” > R”. 
We start with an easier system of equations: g(x) = 0. The function g should also 
be smooth with an easily determined solution. We could use g(x) = x —a fora 
suitable, or even randomly chosen, a. We then connect the easy problem to the hard 
problem with a homotopy: h(x, t) = 0. We choose h: R” x [0, 1] > R"” so that 
h(x, 0) = g(x) for all x (the “easy” function) and h(x, 1) = f(x) for all x (the 
“hard” function). We start from the solution xo for h(x, 0) = 0. We then follow the 
path 

Ci= { (x,t) € R” x [0, 1] | h(x,t)=0} 


from (x9, 0) to (x;, 1). Then x is the solution of f(x;) = h(x, 1) = 0. 

Methods that follow a path of solutions for a homotopy like this are called con- 
tinuation or homotopy methods. We need to identify situations in which this can be 
done in principle. We also need to fill in the details of the methods to achieve this 
and answer the questions about what can go wrong. 


3.5.1 Following paths 


Figure 3.5.1 illustrates the kind of paths that can arise in a homotopy. 
The implicit function theorem of multivariate calculus can be used to identify 
properties of the set 
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Fig. 3.5.1 Ilustration of homotopy paths 


C :={ (x,t) eR" x [0, 1] | A(x, 1) = 0}. 


Specifically, if rank [V, A(x, t), dh/ot(x, t)] =n for all (x, t) € C, then C is a one- 
dimensional manifold, possibly with boundary [114]. There are only two essentially 
different, connected, compact one-dimensional manifolds with or without boundary: 
an interval [a, b] or a circle. Provided C is compact, the connected component con- 
taining (Xo, 0) is therefore a continuous image of an interval [a, b] or the continuous 
image of a circle. If g(x) = 0 has only one solution x = x9 and Vg (x9) is invertible, 
then the only possibility is the image of an interval. The other end of that image of 
[a, b] must have either tf = O ort = 1. Butifx = xo is the only solution of g(x) = 0, 
then it is not possible for the other end point to be at t = 0. Thus, the other end point 
would have to be at tf = 1, which is a point (x*, 1) where f(x*) = 0. 

The condition that C is compact means that the curve of (x, t) points cannot “run 
off to infinity”. This depends on the homotopy that is chosen and the function f/f. 
One of the simplest conditions that can ensure this is the following: 


(3.5.1) x’ f(x) >0 forall ||x||,=R. 


If we then choose g(x) = x — a for some a with |la||, < R, and the homotopy 
h(x,t)=t f(x) + (1 —14) g(x) we find that for any x with ||x||, = R, 


x h(x,t) =tx! f(x) +(1—2t) x" g(x) > 0. 
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It is clear from this that h(x, t) = 0 is impossible for 0 < t < 1 and ||x||, = R. Thus 
the set C does not intersect { (x, ft) | ||x||> = R, 0 < t < 1}. That is, the homotopy 
curve cannot escape the set { x | ||x||2 < R} x [0, 1]. Then C is bounded. Since it is 
also closed in R"*', it is compact. Because it is compact, the component containing 
(xo, 0) ends at (x,, 1) and we have a solution. 

This happy ending relies on an assumption we have made: that rank [V,h(x, ft), 
dh/ot(x, t)] =n for all (x, t) € C. This is not guaranteed. However, we can make 
it likely by incorporating an additional variable a € R"” and making the homo- 
topy h(x,t;a) dependent on a. Choosing a “at random” makes it likely that 
rank [V,h(x, t;a), 0h/dt(x, t; a)] =n for all (x, t) € Cq where 


Ca = { (x,t) € R" x [0, 1] | h(x, t;a) = 0}. 


The key to showing that choosing a “at random” makes it likely (in fact, probability 
one) that rank [V, A(x, t; a), dh/ot(x, t; a)] =n for all (x, t) € Cg is the Morse— 
Sard Theorem: 


Theorem 3.12 (Morse—Sard theorem) If F : R" — R? has all partial derivatives 
of order < r continuous, then provided r > 1 + max(m — p, 0), the set 


{ F(x) | x € R"& rankV F(x) < p} 


has zero Lebesgue measure in R?. 
A proof can be found in either [227] or [92]. An immediate corollary is 


Corollary 3.13 Jf h: R” x [0, 1] — R" has continuous second derivatives then 
the set y € R" for which rank [V,h(x,t), dh/ot(x, t)] <n for some (x, t) where 
h(x, t) = y has measure zero. 


Thus, perturbing the homotopy h(x, tf) = y with small but random y will give the 
conditions under which the paths in C ( y= { h(x, t) | h(x, t) = y}can be followed. 
A more useful result is the following parameterized version [50]: 


Theorem 3.14 Jf h: R” x [0,1] x R” — R" has continuous second derivatives 
where Vah(x, t; a) is invertible for all a, then the set of a € R" for which rank 
[V, h(x, t; a), 0h/dt(x, t; a)] <n for some (x,t) € Cq has measure zero. 


Theorem 3.14 can be applied, for example, to 
h(x,t,a)=t f(x) + 1-1) (« —a). 


For this function V,h(x, t; a) = (t — 1) IJ whichis invertible fort 4 1. Ift = 1 then 
h(x, 1; a) = f(x) which depends completely on our “hard” function f. If Vf (x1) 
is invertible, then we can follow the path all the way to the solution. If V f(x1) is 
not invertible, then the situation is more complex, but the issue lies with f, not with 
following paths. 
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Algorithm 49 Simple continuation algorithm 
1 function continI(h, Vxh, xo, €, N) 


2 For b= 1; 2,...065 N 

3 x; < newton(x +> h(x,i/N),x > Vyh(x,i/N), xi-1,©) 
4 end 

5 return xy 

6 end 


3.5.2. Numerical methods to follow paths 


The simplest way to follow the path h(x, t) = 0 is to subdivide the interval [0, 1] 
into pieces [#;, 141] with t, =i/N fori =0,1,2,...,N. At each step we use the 
solution x; for h(x, t;) = 0 as the starting point for Newton’s method for solving 
h(x, ti41) = 0 for x. Algorithm 49 shows a straightforward implementation of this 
idea. This uses, for example, Newton’s method code in Algorithm 43. In a good 
implementation, a guarded Newton method (see Algorithm 44) should be used instead 
to make the method more reliable. 

This method can handle certain problems well, but not others. In particular, if the 
path being followed “bends back’, then Algorithm 49 will fail in the sense of losing 
the path being followed. In some problems, the path does not “bend back”’. In these 
cases, Algorithm 49 can work. 

To understand the more general case, we suppose that [V, A(x, ft), 0h/dt(x, t)] 
has rank n. The matrix [V, (x,t), dh/dt(x, t)] isn x (n + 1). We can take the QR 
factorization of its transpose: 


[Vih(x,1), dh/at(e,)]" = QR=(O1, dry] | = O1R.. 


Note that under the assumption that [V,A(x,t), dh/ot(x,1f)] has rank n, R; is 
invertible and range Q, is the range of [V, h(x, t), 0h/dt(x, t)]". The vector GVn+1 
is therefore orthogonal to the range of [V, A(x, t), 0h/dt(x, t)]": Gi lVeh (x, t), 
dh/dt(x,t)]’ =0, or equivalently [V,A(x, t), dh/at(x, t)lg,4, = 9. Writing 
G14; =[v", w] we see that 


Vi, A(x, thvu+dh/ot(x,thw=0. 
On the other hand, if we parameterize the path by x = x(s) and ¢t = t(s), differen- 
tiating 
d 
0= —A(x(s), t(s)) 
ds 
dx oh dt 
= Vy h(x(s), t(s))—(s) + —(x(s), t(s)) (5) 
ds ot ds 


=[Vxh(x, 1), dh/dt(x, t)] aa 


3.5 Continuation methods 219 


so [dx/ds™, dt/ds]" is in the null space of [V,h(x, t), 3h/dt (x, t)]. Because the 
rank of [V, h(x, t), dh/ot(x, t)] is n, the dimension of the null space is one. Thus 
[dx/ds", dt/ds] isa multiple of g/,, =[v’, w]. Furthermore, if we parameterize 
the path by its arc length in (x,t), then ||[dx/ds’, dt/ds]|,,=1=|g,.1|, = 
| [v’, w] ls This leaves us with only the choice of sign to be made: dx/ds = +v 
and dt/ds = tw (same sign in both equations). The choice of sign determines 
whether we move forward in the path, or backward. Initially we must move forward. 
Setting x(0) = xo and t(0) = 0, we need dt/ds(0) > 0. However, if the path bends 
back, then we will have part of the curve where dt/ds < 0. There also must be a 
point (x(s), t(S)) where dt/ds(s) = 0. This gives 


Veh, 1) 2H =0. 


But dx/ds(s) £ 0 since otherwise | [dx/ds’, dt/ds| I, = 0 which contradicts the 
use of arc length for parameterizing the curve. Thus, V,(x(S), t(s)) is not invertible. 
This makes Newton’s method dangerous to use, or at least slow, near (x (5), t(S)). 

Computationally there are two challenges we have identified: how to solve 
h(x, t) = 0 when we are close to the curve, but not through Newton’s method, and 
how to ensure that we keep moving “forward” on the path even where dt/ds < 0. 

For the first task, we note that h(x, t) = 0 is an under-determined problem, as 
there are n + | unknowns in (x, f) and only n equations. Given a point (x, t) we 
wish to find a correction (6x, dt) that solves the linearization 


oh 
h(x, t) + Vyh(x, t) dx + Prez t)dt = 0. 


This is an under-determined linear system of equations in (5x, dt). This can be 
done using the QR factorization of [V,A(x, ft), 0h/dt(x, t)]’. In general, sup- 
pose that A is p x q with g < p so that Az = b is an under-determined linear 


system. Take the QR factorization of A’ = OR=[Q), Q2] a = Q,R. Then 


A = Ri Q!. As long as A has rank q, R; is invertible. Then Az = b is equiva- 
lent to R7 Q7z =b and so Q7z = R,'b. Writing z = Qu = Qiu; + Q2uy we 
see that R,"b = QF (Qiu, + Qou2) = uy; and wz is free. If we set uz = 0 then 
Z=Qu,= Q,R,'b. Note that this is not just any solution to Az = b: since 


2 2 ‘ : -Ty: 
Zio = Welle = y leila + [lulls = Iluillz this choice z = Qiu; = Q)R, * bis the 


smallest solution in the 2-norm. Applied to our linearized equation for (6x, dt) we 
see that this approach gives us the smallest value of ||(5x, 5t)||. that solves the equa- 
tion. The updates xt <— x + dx, t+ < ft + dt give the nearest update (x*, t+) to the 
original (x, t) that satisfies the equations. Repeating these corrections in a Newton 
or guarded Newton way gives us a way to solve under-determined systems of equa- 
tions. Often we leave out updating the Jacobian matrix [V, h(x, t), dh/ot(x, t)] after 
updates to (x, ft) if the initial guess is close to a solution. This avoids the expense 
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of the QR factorization at each iteration, while still giving a good (if linear) rate 
of convergence per iteration. An implementation that recomputes and refactors the 
Jacobian matrix at each iteration is shown in Algorithm 50. 

The QR factorization is also key in the second computational task: finding which 
direction to continue along, +q,,,, Or —4,,,. Using 


[Vsh(x,1), h/at(e, 1)” = OR =(O1, dni lo" | ; 


we can add an extra column to get (recall that q,,., = [v?, w]7) 
Vrh(x,t)) vo] _ R, 0 
Baa iP w fa lOr dneil) gt 1 | 
: Senate : : ; R, 0]. 
The right-hand side is invertible since Q = [Q1, q,,,,] is orthogonal, and o7 1 | 3 
invertible as R, is invertible. Note that v = tdx/ds and w = tdt/ds (same choice 
of sign in both equalities). Thus, 


Vrh(x,t)) v | Vxh(x,t)’ dx/ds 
uct ee 1)? | — ee oe ae | a 


Assuming all quantities continuously differentiable, 


deck Vxh(x,t)’ dx/ds 
"| ah/at(x, t) dt/ds 


is continuous, real, and never zero. Thus its sign must remain constant. We can 
compute the sign of this determinant oo at (x, t) = (Xo, 0), so we can use the sign 


Vyh(x,t)? v ; : : 4 
of det baa At iw to determine the choice of sign for v = tdx/ds and 
w = tdt/ds. 


The sign of det Q where Q is the orthogonal factor of a QR factorization can 
usually be computed easily. If Q is, for example, a product of Householder reflectors 
each Householder reflector contributes a factor of (—1) to the product. The sign of the 
determinant of R is simply the parity of the number of negative diagonal elements. 
Then v = o dx/ds and w = o dt/ds where o = og sign(det Q) sign(det R;). 

An alternative approach is to keep the directions [dx /ds’, dt/ds] consistent as 
s changes. More specifically, if s ~ s’ then we expect 


dx/ds(s) |_| dx/ds(s’) dx /ds(s) : dx /ds(s’) 0 
dt/ds(s) | ~ | dt/ds(s’) 8° | dt/ds(s) | | dt/ds(s’) | ~~ 
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Algorithm 50 Guarded Newton method for under-determined systems 


1 function gudnewton(f, Vf, x0, €) 


2 k<—0O 
3 while || f(xz)ll. > € 
4 A<Vf (xp); A=QR (QR factorization) 
F R 
5 split Q=[Q1, Qo]; R= al so A=Q1R\ 


6 dy ——Q1R," f (xe) 
7 a<l1 
8 while f(x +adx)llo > 1-30) If ewll 
9 a<a/2 

0 end while 

Al Xkt1 << Xp tadk; k—k+1 
12 end while 

3 return Xx, 

4 end function 


Algorithm 51 Homotopy algorithm (uses Algorithm 50) 


function homotopy(h, Vh, xo, 5s, €, 1) 


2 t<0; s<0; x<x0 // assume h(xo,0) =0 
3 oldq <0 

4 A<Vh(x,t); A=QR // QR factorization 
5 split O= 101.41, R=[Gr| 

6 Lf qnt1 <0: q<——q 

7 oldq <q 

8 while t<l 

9 if t+ésw>1: 6bs<(1—f)/w; end if 

10 xt<e_xtosv; tt ett+isw 

11 A<Vh(xt,tt); A=QR // QR factorization 
12 split Q=[Q1,q], R= of 

13 if q'oldq <0: q<—q; end if 

14 if Z(q,oldq)>n: 5s <—6s/2; continue 

15 (xt, t+) <— gudnewton(h, Vh, (xt, tt), €) 

16 (x,t) < (xt, t+) 

17 oldq <—4q 

18 end while 

19 x <newton(f, Vf, x, ©) 

20 return x 

21 end 


This gives a simpler approach to controlling how quickly to increment s; to 
Sg41: ensure that the angle between [dx /ds(s,)", dt/ds(s,)] and [dx /ds(Sp41), 
dt/ds(Sx41)] never exceeds a small user-specified threshold n > 0. If the threshold 
is exceeded, go back to s;, reduce the increment ds (say by halving it), then setting 
Sk+1 <— Sx + 6s and try again with the new value of s,41. 

In Algorithm 51,Vh(x, t) = [Vyh(x, t), dh/dt(x, t)]. 
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Details on how to implement homotopy algorithms can be found in the book by 
Allgower and Georg [4]. The package HOMPACK [257] developed by Watson et al. 
implements continuation algorithms. 


Exercises. 


(1) Brouwer’s fixed-point theorem states that if B= {x € R” | ||x||, < 1} and 
Jf: R” — B iscontinuous, then there is a fixed point x* € B where f(x*) = x*. 
Show that this is true for f smooth. [Hint: Let h(t,x,a) =x —-—[(U—-—that+ 
t f(x)]. Then provided t £4 1, Vah(t, x, a) = (t — 1)/ is invertible. Then by 
Theorem 3.14, for almost all a in the interior of B, the solution to h(t, x, a) = 0 
is a union of smooth closed curves. Starting at x = a for t = 0 we continue on 
the curve C := { (x,t) | h(t, x, a) = 0} until we hit a boundary: either x € 0B 
or t = 0 ort = 1. We cannot hit x € dB because |la||, < 1. Show that we can- 
not hit ¢ = O because this would mean C is tangent to tf = 0 at (0, a). Thus we 
approach t = 1. Use compactness of B to show that there is a limit point (1, x*) 
where x” is a fixed point of f.] 
Suppose that f: C — C is an analytic function so that df/dz(z) always exists 
as a complex number for all z € C. Treat f as a function R* > R?, with 
f() =u(a, y) +i v(x, y) where z= x+iyandu, v, x, y all real. Show that 
det V f(x, y) = 0 with equality only if f’(z) = 0. For a homotopy A(t, z) = 
t f(z) + ( — fr) p(z), where both f and p are analytic, show that the homotopy 
curve { (t, z) | h(t, z) = 0} with random choice of p(z) does not “turn back”. 
[Hint: If we write (t, z) = (t(s), z(s)) as smooth functions of the arclength s, 
then dt/ds = O implies that 0h/dz(t, z) = 0. Use the parameterized Sard theo- 
rem to show that with probability one dh/dz(t, z) 4 0 on the homotopy curve. ] 
(3) We can use homotopy methods to find all complex roots of a polynomial: 
if f(z) is a polynomial of degree d with leading coefficient one, set p(z) = 
file (z — rg) with randomly chosen r;, € C. Show that the homotopy curve 


oC 


wa 


{@z2|tf@+d—-Hp@ =0,0<t<1} 


is bounded. [Hint: The coefficient of z@ int f(z) + d — t)p(z) is one. The other 
coefficients for 0 < t < 1 are bounded.] 

(4) Use HOMPACK or some other implementation of homotopy methods to solve 
f(x) = 0 where 


ety —cosx + 23 —3 x 
f(x) = sinz+xy+2 : x=] y 
2—z2xty)+x2-7 Z 


(5) The package AUTO by Eusebius Doedel [77, 78] is designed for parameter 
continuation, that is, it uses homotopy methods to follow solutions as user- 
specified parameters are changed. Thus, it cannot assume that Vh(t, x) has full 


3.5 


(6 


wm 


7 


wm 


(8 


wm 


(9 


wa 


(10) 
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rank, and must occasionally deal with bifurcations and other singular behavior. 
Read the documentation for AUTO and describe how it works. 
Consider the homotopy h(t, x) = x73 —x+4+(6t —3)withO<t<1.Atr=0 
we start with a negative solution, and end with a positive solution at ¢ = 1. 
Explain why the “turning back” that occurs in this homotopy (twice) means that 
applying the Newton method just to fix errors in x, rather than working with 
(t, x) together, will have difficulty. 

Consider solving the equations f(x, y) =0 and g(x, y) =0 with f and g 
smooth on (x, y) € [a, b] x [c, d]. Assume thatif f(x, c) < Oand f(x, d) > 0 
for all x € [a,b], and g(a, y) < 0 and g(b, y) > 0 for all y € [c,d]. Show 
that there must be a solution (x*, y*) € [a, b] x [c, d]. [Hint: Let x = [x, y]’, 
f(x) =[f (x,y), g(x, y)]",and k(x) = [x — $(a +b), y — $(c +.d)]. Define 
the homotopy h(x, 4) = (1 — A)k(x) +4 f(x). Show that h(x, 4) = 0 has no 
solutions for 0 < A < | and x on the boundary of the rectangle [a, b] x [c, d]. 
Use Sard’s theorem to show that {x | h(x, 2) = s } consists of smooth curves for 
almost alls. Take a sequence y, — 0 ask — oo where this solution set property 
is true for all y,. Use compactness to [a, b] x [c, d] to show the existence of a 
solution x* of f(x*) = 0.] 

Generalize Exercise 7 for rectangles R = [a,, b,] x [a2,b.] x---x [q, 
b,] C R" and functions f;: R’ — R for i=1,2,...,n with f(x) <0 if 
x; = a; and f;(x) > Oif x; = b; forallx € R. 

As i(s) approaches one, it is possible that we could have a turning point 
at 4(s*) = 1. In this case, the curve (x(s), A(s)) > (X, 1) as s > s* where 
Vh(x, 1) is not invertible. In this case, using Newton’s method on x + h(x, 1) 
is unlikely to work well. An alternative approach is to use quadratic interpolation 
on (x(50), A(S0)), (% (51), A(s1)), (* (52), A(S2)) With so < 5; < Ss. < s* and use 
this interpolant to estimate (¥, 1). Implement a scheme of this type, taking care 
to make sure that the method is robust to small perturbations. Test this on the 
homotopy h(x, A) = Ax? +(1-—A)\(2+x-e*). 

An application of Brouwer’s fixed-point theorem (Exercise 1) is the Perron— 
Frobenius theorem [20]: Any n x n matrix A with a;; > 0 for all i, j has 
positive eigenvalue 4 with a positive eigenvector v. Prove this using the map 
xh Ax/||Ax||; of the unit simplex © = {x € R" | x; > 0 for alli, and 
>o_) x; = 1} onto itself. [Hint: First show that the map takes © — ¥. Then the 
Brouwer fixed-point theorem implies that this map has a fixed point. The fixed 
point is the desired eigenvector.] From this result show that any other eigenvalue 
wu of A satisfies |u| < A. 


Project 


Consider the problem of determining the steady flow through a network of pipes. 
This network is considered to be a directed graph G = (V, E) without cycles where 
each edge e € E of the graph has a start vertex start(e) € V and an end vertex 
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end(e) € V. We can represent this graph by a sparse matrix B where b.., = +1 if 
x = end(e), —1 if x = start(e), and zero if x is neither a start nor end vertex of e. 
The flows are generated by pressure differences: the flow: ge = fe(p» — py) where 
x = Sstart(e) and y = end(e) with the functions f. given. There is a set of source 
nodes S and a set of destination nodes D. We assume that f.(0) = 0 and are smooth 
and increasing functions. Conservation of mass means that for any node x ¢ SUD, 
the net flow into x is zero: ». be.xde = 0. If x € S is a source node then there is a 
flow G, being injected into this node, which must then move through the network of 
pipes: g, + ». be.xde = 0; similarly, if x € D is a destination node, then there is a 
flow —q, being removed from node x; the conservation equation at x can also be 
written as G, + Des be,xde = 0. Set the pressure at one node to be zero. Also, from 
conservation of the material flowing in the network, the total inflow must equal the 
total outflow: > 5 GZ: + oyep Fx = 0. So one inflow or outflow variable should be 
left “free”. This is implemented by removing the equation Gg, + >>, be.xge = 0 for 
that node from the system of equations to be solved. (This could be the same node 
where you set p, = 0.) 

Set up the nonlinear equations to solve for the steady state of a network of this type. 
The variables are the flows gq, and the pressures p, inside the network. The inflow 
quantities g,, for source vertices x € S and outflow quantities —@, for destination 
vertices x € D are given. Set up a guarded Newton method for solving this system 
of equations. Implement your method in a suitable programming language. Can you 
show that the Jacobian matrix for your linear system is invertible? 

As a test problem, solve the problem shown in Figure 3.5.2. For this test use 
the table of flow functions f,(Ap) = a, Ap + B. Ap/,/y? + (Ap)* given by the 
parameters in Table 3.5.1. 


4 


Fig. 3.5.2. Network of pipes—test problem 
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Table 3.5.1 Network of pipes—test problem 


edge (e) | de Be Ve edge (e) | & Be Ve edge (e) | & Be Ve 
ab |\1 0 - b>g jl 0 - e>h |0 1 1 
a>c |1 1 2 cae |0 2 1 fwh jl 2 1 
ad \2 1 1 c>f |2 1 2 goh |\2 2 2 
b>e |1 2 1 d—>f |0 1 1 —_ - - - 


An alternative formulation is to write the pressure difference as a function of 
the flow through an edge: py — py = 8e(ge) where y = end(e) and x = start(e). 
Re-write your system of equations to use the functions g, instead of fe. 


Chapter 4 ®) 
Approximations and Interpolation cro 


The representation and approximation of functions is central to the practice of numer- 
ical analysis and scientific computation. The oldest of these is Taylor series developed 
by James Gregory and later Brook Taylor. However, the development of Taylor series 
requires knowledge of derivatives of arbitrarily high order. Interpolation instead only 
requires function values, and/or a few low-order derivatives at interpolation points. 

Usually these methods use polynomials as the interpolating functions, although 
trigonometric functions as the approximating functions are especially useful in meth- 
ods based on Fourier series, and piecewise polynomials such as splines have prop- 
erties that make them robust and practical tools. 

How well a given function can be approximated by simpler functions in one of 
these classes is an essential question for many applications. The quality of an approx- 
imation can be measured in different ways. The most commonly used measures of 
closeness of approximation are the oo-norm (or max-norm) to describe worst-case 
errors, and the 2-norm for least squares approximation. Multivariate approximation 
properties are especially important in application to partial differential equations. 
Modern approximation theory has been applied to questions arising in data science 
and machine learning, where multivariate functions (often functions of many vari- 
ables) need to be approximated using sparse data sets. 


4.1 Interpolation—Polynomials 


Interpolation is the task, given a collection of data points (x;, y;),4 = 0,1,2,...,n, 
to find a function p in a specified class F where p(x;) = y; fori = 0,1,2,...,7. 
Usually F is a vector space of functions, such as the set of all polynomials of 
degree < d. Ideally, the dimension of F is equal to the number of data points, and 
the solution exists and is unique. Other classes of functions can be used, such as 
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piecewise polynomials like cubic splines, or trigonometric functions cos(kx) and 
sin(kx) for integer values of k. 


4.1.1 Polynomial Interpolation in One Variable 


For polynomial interpolation in one variable, given data points (x;, y;), 
i =0,1,2,..., we want to find a polynomial of degree < d where 


(4.1.1) p(xi)=y; fori =0,1,2,...,n. 


In order to have as many equations as unknowns, we note that the number of coef- 
ficients of p(x) (d + 1) should be equal to the number of data points (n + 1). That 
is, we want d+ 1=n-+1sod =n. This condition is necessary but not sufficient 
in order to guarantee existence and uniqueness of the solution. 

The simplest non-trivial case is with n = d = 1, so we want to have a linear 
function p(x) going through two data points. Provided the two data points are distinct, 
there is exactly one straight line going through these points. Provided the data points 
have different x-coordinates (xo # x), there is exactly one linear function going 
through these data points—the straight line through the data points is not vertical. 
Linear interpolation has ancient roots—it was used by ancient Babylonians, and 
by ancient Greeks from about 300 BCE, for astronomical calculations. Chinese and 
Indian astronomers used quadratic interpolation in the seventh century, and higher 
order interpolation during the thirteenth and fourteenth centuries. Usually they used 
interpolation for tables of trigonometric functions to get accurate values “between 
the lines” of the values given. 

Interpolating values in tables of values are no longer so important, but there are 
many ways in which interpolation is a central topic in numerical analysis. At heart, 
interpolation allows us to represent a function (approximately) in terms of a few data 
points. Operations that we would perform on the function, such as integration and 
differentiation, can be represented in terms of the values at the interpolation points. 


4.1.1.1 Existence and Uniqueness 


For higher order polynomial interpolation, the crucial condition we need is that none 
of the x;’s are repeated, that is, we need the interpolation points x9, x1,...,X, to 
be distinct: if x; = x; but i # j, then we clearly need y; = p(x;) = p(x;) = yj; in 
order for an interpolant p to exist. In particular, if x; = x; but y; 4 y;, there is no 
interpolant of the data. 

Even if an interpolant exists for repeated interpolation points, the number of 
equations to be satisfied is reduced by one, so the solution cannot be expected to be 
unique. But if the interpolation points xo, x1, ..., X, are distinct, then there is exactly 
one interpolant p(x) of degree <n. 
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Theorem 4.1 /f (x;, yj), i =0,1,2,...,n and x; #4 x; fori & j, then there is one 
and only one polynomial p(x) of degree <n where 


pa@i)=y foralli=0,1,2,...,n. 


Proof We represent a polynomial of degree p(x) in terms of the vector of its coef- 
ficients a € R"*!: 


D(x; a) =ag+ayxt+ Gx? fs Fayx”. 
Clearly, p(x; a) is a linear function of a. Then the function 


P(X; a) 
p(x; a) 


F:apw eR! 


pn: a) 


is a linear transformation F: R’+! — R"+!. So, this linear function is onto if and 
only if it is one to one. In particular, we want to show that the only a for which 
F(a) =0isa=0. 
So suppose that a®* satisfies F(a*) = 0. Then the polynomial g(x) = p(x; a*) is 
a polynomial of degree <n where g(x;) = 0 fori = 0,1, 2,...,. Then g(x) can 
be factored 
g(x) = (x — x0) — x1) ++ (@ — Xn) BCX) 


for some polynomial g, provided all x;’s are distinct. This implies that the degree of g 
is (deg g) + (n+ 1). The only way this can be no more than n is if deg g < 0 which 
only occurs if g is the zero polynomial. That is, g(x) must be the zero polynomial, 
and hence all its coefficients must be zero. That is, a* = 0 as we wanted. Thus, the 
linear transformation F is one to one and so F is onto, that is, there is exactly one 
solution a to F(a) = y for any y. This solution gives the coefficients of the unique 
interpolating polynomial. 


4.1.1.2 Computing the Polynomial Interpolant 


The next task is computing the interpolant. Since the equations to be satisfied are 
linear equations in the coefficients, we can solve these equations directly: 


1 yep ae do yo 
is cite cree ay YI 
(4.1.2) Viea | 1 Xeag sg || | | 


2 
1 X_ X0 +++ xp An Yn 
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Table 4.1.1 Condition numbers «2(V,,) of Vandermonde matrices 


n 2 3 4 5 6 7 
K2(Vn) 1.51 x 10! |9.89 x 10! | 6.86 x 10? | 4.92 x 103 | 3.61 x 10* | 2.68 x 10° 
n 8 9 10 11 12 13 
K2(Vp) 2.01 x 10° | 1.52x 107 |1.16 x 108 | 8.83 x 108 | 6.78 x 10? |5.22 x 10!° 


The matrix V, is called a Vandermonde matrix. By Theorem 4.1, this matrix must 
be invertible, provided all the x;’s are distinct. A more direct proof is the calculation 
that its determinant is Tie; (x; — xj). 

While it is possible to compute the coefficients of the interpolating polynomial 
this way, it is not recommended. The reason is that the condition number of V,, can 
easily become very large. For example, if we take x; = i/n fori = 0,1,2,...,n, 
the 2-norm condition numbers are shown in Table 4.1.1. 

The condition number of Vandermonde matrices grows exponentially in n [72]. 

Because of the explosive growth of the condition numbers of Vandermonde matri- 
ces, other methods have been devised that have avoided both the computational cost 
and ill-conditioning of directly solving for the coefficients. 


4.1.1.3 Lagrange Interpolation Polynomials 


While Joseph-Louise Lagrange in 1795 got the title credits for the polynomials used 
to create interpolants 


(4.1.3) Lix) = T] (=). 


ary 
pif OE 


the formulas were published earlier by Edward Waring (1779). The interpolant p(x) 
of the data p(x;) = y; fori = 0,1, 2,..., can be written as 


(4.1.4) pe)= > wie): 
i=0 


The reason for the correctness of these formulas is that 


Litx) = [] (4-4) =0 ifk#i, and 


Xj — Xj 


L(x) = I] (=) =1, so 
i J 
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Fig. 4.1.1 Lagrange 1 
interpolation polynomials for 


x9 =0,x, =4,22 =1 0.8 


. “ l; tkSi 
P(Xk) = 2, HEite) = 2 fi ifk #i = Jk 
Directly computing L;(x) from this formula takes ~ 4n flops; thus computing p(x) 
would take * 4n? flops via (4.1.3). This compares well with ~ 5n3 flops for solving 
the Vandermonde equations. However, this is not the fastest method for computing 
the value of an interpolating polynomial. In spite of this, Lagrange interpolation 
polynomials provide a convenient means of representing the polynomial interpolant 
of a given degree, and we will make use of this in later sections. 
For example, if x9 = 0, x; = 1/2, and x2 = 1, then 


_ @-n@—nm) 21 
Lo(x) = ae eT ae 2(x Do 1), 
ep a, So een eati=®: 
(x1 — xo) (x1 — x2) 
_ @—x)G—-n) _) 1 
Ceca eneagy 


Note that Lo(x) = Lo(1 — x) due to the symmetry of the placement of the inter- 


polation points. Plots of the three quadratic Lagrange interpolation polynomials are 
shown in Figure 4.1.1. 


The quadratic interpolant of (xo, yo), (%1, yi), and (x2, yz) is then 
P(x) = yoLo(x) + yi Li (x) + y2L2(%) 


1 1 
= 2yo(x — pe —1)4+4yix( — x) + 2yox(x — 5): 
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0 0.2 0.4 0.6 0.8 1 


(A) Plots of p(x) & e* (B) Plot of e* — p(x) 


Fig. 4.1.2 Comparison between quadratic interpolant p(x) and f(x) = e* 


If the y; values come from some function: y; = f(x;), then we can use p(x) as a 
quadratic approximation to the function f(x). For example, if f(x) = e* then the 
interpolating quadratic is 


1 12 1 
p(x) = 2(x - ye —1)+4e'*x( — x) + 2ex(x — a 


A plot showing a comparison between p(x) and f(x), as well as a plot of the 
difference is shown in Figure 4.1.2. 


4.1.1.4 Divided Differences 


In a letter from 1675, Isaac Newton described a method of interpolating equally 
spaced data [96]. That method is basis of the modern divided difference method, 
which allows non-equally spaced data. The starting point for this method is the 
definition of divided differences. Firstly, for first-order divided differences, 


_ £@0) = fe) 


f[xo, x1] for xo # x1, 
x0 — X1 
and continues to higher orders by the formula 
FS lxo, x1, see »Xn-1] i Slx1, x2, tae Xn] 
F(xo, X1,---5 Xn] — 
XQ — Xn 


provided all x;’s are distinct. For example, 


flxo, x1] — flx1, x2] 
Xo — X2 
_ (f (0) — f 1))/(x0 — *1) — (F 1) — f(x2))/(r1 — x2) 


XQ — X2 


f[x0, %1, x2] = 
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4.1.1.5 Independence of Ordering 


The first-order divided differences do not depend on the order of the inputs: 


_ fo) = fen _ Gi) — f@o) _ 


Xo — X] X1 — Xo 


f(x, x1] f (x1, xo]. 


This works more generally for higher order divided differences: the order of the x;’s 
in the divided difference does not matter. 


Theorem 4.2 Jf yo, y1,.--, Yn is any permutation of x9, X1 ...,Xn and all x;’s are 
distinct, then f[x0,X1,.--,%n] = flyo, Y1,---5 Ynl- 


Proof We show by induction on n that 
= Ff (xi) 
ig TTi=0, jpi Xi — Xj) 


flxo, X1,---,%nJ = 


Once we have proven that this formula holds, then it is obvious that the order does 
not matter. 
This formula holds for n = 1, since 


f (xo) — fx) 
Xo — X1 
_ fo) i f(x) 


XQ — X41 x1 — Xo 


F[xo, x1] = 


Now for induction, we suppose that the formula holds for n = k, and we want to 
show that it holds for n = k — 1. Now consider 


FUX0s Mis vee ety XE] 
_ F[%o, X1,- ~~ He-1] — FX, X2,-- + Xe] 
X0 — Xk 


a fi) i f (i) 
-|> ee 


i=0 Tee pie FS janie = 2p 
! f (Xo) 


k-1 
X0 — Xk TTj=0, j40(%0 — xj) 


ea : i 
+) fi) = 
2 (= i k 


jo, jet —%j) TT jna, jess — 29) 
fx) 
T1jmt, ju Oe — 2)) 
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The first and last terms are good since they are f(xo)/ Tio. j40(%0 — Xj) and 
St (xx)/ aver i¢k (Xk — X;). For the middle term, we need to concentrate on the quan- 
tity inside the parentheses (---): Looking for common factors we find (noting that 
0<i<k) 

1 1 


k-1 k 
[Tj=0, jpi Xi — Xj) TTj=1, jpi Xi — Xj) 


= 1 1 1 

7 Fie jpi%i — Xj) E i a 
= 1 (xj — Xx) — (Xj — Xo) 
Fes; ini Xi — Xj) (xi — X0) (Xi — Xx) 


X0 — Xk 


7 ; 
[Tj=o, jx Xi — Xj) 


Diving by (xo — xx) gives the correct term for f(x;). So 


k 


f (xi) 
i=0 T1j=o, jei i — Xj) 


f[x0,X1,.--5%] = 


as desired. 


4.1.1.6 Repeated Arguments in Divided Differences 


Formally, f[x0, x1, ..., Xn] is only defined if xo, x1, ..., x, are all distinct. However, 
we can give a meaning to f[xo, x1,..., Xn] even if some or all of the x;’s are the 
same. Here we will look at what happens with just two are the same. 

First, f[x0, xo] = (f (xo) — f(%0))/(%o — x0) = 0/0 is undefined so we cannot 
use that formula. However, the limit does exist if f is differentiable: 


Fix, xo] = Ln Filxo, x1] 


2 gg CDT = oi ees. 
X1>X0 X1 — Xo 


Once we have this, we can work out 


., Flxoxo) = fiver] _ f' Go) — fla. 41] 
flxo, x0, %1] = = ; 
xo — X1 xo — X1 


and higher order divided differences for x9 € x1. 
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4.1.1.7 Divided Differences and Polynomial Interpolation 


With the result that the divided differences are independent of the order of the points 
X0,X1,-+--+,X, we can show how they relate to polynomial interpolation. First we 
note that 


isafene Lae 
X — Xo 


= f (xo) + flx, xo] (& — Xo). 


However, this trick can be repeated for f[x, xo]: 


ieee Sy ei es 


X—X\ 
= f[xo, x1] + flx, xo, x1] & — x1). 


And again: 
[x, Xo, x1] — [x2, Xo, X1] 
focal Gio ee u = (x — x2) 
X —X2 
= flxo, x1, x2] + f[x, x0, X1, x2] (% — x2). 
In general, 
Fx, x0, -- +, Xe-1] = f[Xk, XO, X1,- ++, Xe-1] 
X,XQs---s Xe] — fxg, XQ... XE 
il 0 k—-1] — FLX, Xo my i 
X —Xk 
= flxo, %1,-.-,Xe-1] + FLX, X0, 41, ..., XK] (X — XK). 


Now we can use this to unwrap our function to expose its interpolating polynomial: 


f@) = fo) + @ — x0) FLX, xo] 
= f (xo) + (x — x0) (fLx0, 41] + FLX, x0, x1] (% — x1) 
= f (Xo) + & — xo) flxo, x1] + & — xo) — X11) FIX, x0, 41] 
= f (Xo) + (& — x0) flxo, x1] + (& — x0) — x1) f[X0, *1, x2] 
+ (X — x0) — X1)(X — x2) FLX, X0, X11, x2] 
etc. 


In general, we get 
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k i-1 
fx) =o flo x1... P]@ - x) 
i=0 j=0 


k 
+] — xj) f[x, x0, X1,.-., XK]. 


j=0 


The first part 


i-l 
Peed oe] [ [@-2A 
: iS 


is a polynomial in x of degree < k. Furthermore, the extra part 


k 
[[@ — xj) f[X, Xo, X1,.--5 Xx] 


j=0 
is zero whenever x = x;, j = 0,1,2,...,k. So the interpolating polynomial is 
k i-l 
(4.1.5) pax) = > flxo. x1, ---. 44] [J @ — x), 
i=0 j=0 


and the error is 
k 
(4.1.6) SQ) — pe(x) = [[@ —xj)] FIX, Xo, %1,..., Xx]. 
j=0 


Note that to compute p(x) we only need k+1 divided differences: D; := 
Flxo, %1,---,Xi],i =0,1,2,...,k. 


4.1.1.8 Error Formula 


In many cases, we use polynomial interpolation to approximate a definite function /. 
If p, is the polynomial interpolant of degree < n at interpolation points xo, x1, ..., Xn 
then for every x there is a c, where 


f° MEX) 
(4.1.7) F(X) = Pr(x) = GD! (x — X0)(x — x1)-++(% — Xn). 


Here is a simple way of deriving it. 


Theorem 4.3 (Interpolation error formula) Suppose f is (n+ 1) times differen- 
tiable. Suppose that p, is the polynomial interpolant of f at (n+ 1) distinct 
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interpolation points Xo, X1,..., Xn. Then there is ac, between min(x, Xo, X1,---, Xn) 
and max(x, X09, X1,..., Xn) such that (4.1.10) holds. 


Proof Consider the function 
A(t) = [f(@) — Pa@) YO) — FO — pO YO), 


with W(t) = ieee — x,). This function is (n + 1) times differentiable. We note 
that since f(x;) — pa(x;) = 0 and V(x;) = 0 fori = 0, 1, 2,..., we have 


A(x) = (fF) — pn) ¥ Qi) — [F Qi) — PnQ@iIY@%) =0, — and 
h(x) = [f£@) — pa) BO) — LF) — pa @)] YO) = 0. 


Thus h has (n + 2) zeros in the interval [Xmin, Xmax] With Xmin= Min(x, X0, X1,.--5 Xn) 
and Xmax = Max(X, x0, X1,---, Xn). Between every adjacent pair of these zeros, by 
Rolle’s theorem, there is a zero of h’. Thus h’ has at least (n + 1) zeros in [Xmin, Xmax].- 
Again, we can apply Rolle’s theorem for each adjacent pair of zeros of h’ to show 
that h" has at least n zeros in [Xmin, Xmax]. Repeating this argument, we see that Aety 
has at least one zero in [Xpin, Xmax]. Pick one of them. Call it c,. 

Now 


AY) = LF) — pa BOP @ — [FO () — pO] WE. 


But prre® = 0 for any ft, since p, is a polynomial of degree <n, and so 
its (n+ 1)’st derivative is identically zero. Now V(t) = [eat —x) = rt 
q(t) where g is a polynomial of degree < n. Thus W'*+)(t) = (d/dt)"*!(t") — 


(d/dt)"*!(q(t)) = (n+ 1)! —0 = (n+ D}. Thus 
0= AM (cy) = [f(&) — Pra] t WD! f°? (cy) W(). 


Re-arranging gives 
pee,) 


Sf (%) — Prax) = ep 


as we wanted. 


This error formula will be very valuable in later estimates of the error in methods 
based on polynomial interpolation. 


4.1.1.9 Divided Differences and Integrals 


There is an integral representation of divided differences. Let’s start with 
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1 
i; f' (xo + 1% — X0)) dt 
0 


1 = 
= FX + te — x0) IZ} 


X1 — Xo 
= FO) — fo) = f[xo, xi]. 
X1 — Xo 


In general, let 
Tr = {(t,..-5t») | > Oforalli =1,...,n, and ea tt, 
i=1 


This is the n-dimensional standard simplex, which is a polyhedron with the vertices 
0 and e;,i = 1,2,...,n. Then we can show that 


Theorem 4.4 If f isn times continuously differentiable, then 


(4.1.8)  — f[X0, %1,---, Mn] = 7 f (to + Yo Gi — x0) dtr dty +++ dtp. 


T™ i=1 


This is known as the Hermite—Genocchi formula. 


Proof The proof is by induction of n. Since 7, = {t; | 0 < t; < 1}, we have already 
shown that (4.1.8) is true for = 1. 


Now let’s suppose that (4.1.8) is true form = k — 1. We want to show that (4.1.8) 
is true for n = k. To show that consider 


k 
i; FCO + 004 — x0) dt dtp... dt, 
Tk i=l 


k-1 


1D 
=i dt,--+ dt) i f©o+ ) ti (xi — Xo) + th (XE — X0)) dtg. 
Tht 0 


i=l 
The inner integral is 


k-1 


12 55I t 
i f (x0 + D> 1: (%j — x0) + te (Xe — X0)) at 
0 


i=l 
1 k-1 Td Sr 
= —— f® G+ > ti (xi — X0) + te(Xe — X0)) 


Xk — X0 on 
i=1 %=0 
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XxX, —X 
k 0 Pa i=l 


1 k-1 k-1 
= [Gx +S) ti (x; — x0) + (xe — %0) — Dt: — X0)) 


k=] 
=f Gat y= ») 


i=1 
i k-l k=l 
——_ Dx + ft; — a0) — Fo + Doi ») 


Xk -— xX 
k 0 i=l i=1 


So our original integral is 


k 
/ fEo+ SAG —x9))dty dtz ... dt, 
Tk i=l 


k-l k-1 
1 
7 / dt, +++ dt [Pas Ss — xe) — FM 0+ Yo i -w) 
Tho i=] 


Xk — XxX 
k 0 i=l 


f1Xks X15 +++ Xe—1] — flxo, ¥1, +--+, %K-1] 
= =f x0; Fires Xk—-1, Xk] 
XK — XO 


as we wanted. 
Then by the principle of induction, (4.1.8) holds forn = 1, 2,3,.... 


We can use this integral representation to obtain an estimate for f[xo, x1, ..., Xn] in 
terms of f“ provided f is n times differentiable: by the mean value theorem for 
multiple integrals, 


1 
(4.1.9) fii Hel Sf Oval) = at) 


for some c between min(xo, x1, ..., X,) and max(xo, X1,..., Xn). 
Lemma 4.5 The n-dimensional volume of T, is vol(t}) = 1/n!. 
Proof The n-dimensional volume is 


le / 


n 


1 
dt, dtr... dty =i at, | dt dtr... dty_4, 
0 ThA 


Where: 5 = basa ha) | eee OE, y ti <1—%t,}, which is just T,—1 
scaled by a factor of 1 — t,. So 


1 1 
1 
vol(Tn) = i dtp vol((1 — tn)Tn—-1) = i dtp (1 _ ie vol(Tn—1) = vol(Tn-1)- 
0 0 n 


To start the recursion, note that 


1 
vol(7) = i dt, = 1, 
0 
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so we get vol(7,) = 1/n!. 


Thus we get 
1 
F[xo, X1,---,Xn] = tO 


for some c between min; x; and max; x; provided f is n times continuously differen- 
tiable. This formula can be applied to the error formula for polynomial interpolation: 


F(X) — pale) = FLX, x0, 11, «++ Fn] ie — xj) 


(4.1.10) =a zt ato] iG Ns 


for some c between min; x; and max; x;, provided f +) is continuous. 


4.1.1.10 Computing Interpolants via Divided Differences 


While we have formulas for the divided differences f[x0, x1, ..., xx], implementing 
these formulas directly does not lead to the most efficient way of computing these 
divided differences, or in evaluating the interpolating polynomial p(x) at a point x. 

To compute the divided differences f[xo,x1,...,x] fork =0,1,2,...,n, we 
consider a divided difference table: 


xo f (xo) 
x1 f(x) fx, x1] 
X2 f(%2) fx, x2] FS [x0x1, x2] 


Xp FU) fleet %e] Fen Siaeasdl rs F[xo, x1, tee Xn]. 


We wish to compute the second-order divided difference f[xo, x1, x2]=(f[x0, x1] — 
F (x1, X2])/(%o — x2), so we need both f[xo, x;] and f[x1, x2]. This uses the values 
in the table to the entry in the table on its left, and the entry left and up. 

It is not necessary to keep all entries in this table, even if we compute each of 
these entries. The entries we need to keep for evaluating the interpolating polynomial 
are the diagonal entries Dp = f (x0), Di = f[%o0, x1], D2 = f[xo, 1, X2],... Dn = 
Filxo, Xia sey Xn]. 

We begin the computations by first computing the first-order divided differences 
FS (xi-1, x;]. We do not wish to overwrite f (x9), but we can overwrite f (x1), f(%2),..., 
fn) with f[xo, x1], flx1, x2], ..., fPtn—1, Xn]. We can first overwrite f(x,) with 
FXn-1, Xn] = Cf On-1) — f On) /(Xn-1 — Xn), as there is no further use for f (x,) in 
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Algorithm 52 Divided difference algorithm 


1 function divdif (x, y) 

2 d<y 

3 for k=1,2;...,n 

4 for i=n,n-1l,..., k 

5 di <— (dj — di-1)/(xi — xi-«) 
6 end for 

7 end for 

8 return d 

9 end function 


Algorithm 53 Polynomial interpolant evaluation 


1 function pinterp(x, D,t) 

2 pval — Dy, 

3 for k=n-1,...,1,0 

4 pval — Dx + (x — xx) + pval 
5 end for 

6 return pval 

7 end function 


this divided difference table. Then f(x,—1) can be overwritten by f[Xn-2, Xn-1] = 
(f &n—-2) — f &n—-1))/(n—-2 — Xn—1). We can repeat this process going from the bot- 
tom to the top of the first-order divided differences. Once the first-order divided 
differences are computed, the second-order divided differences can be computed, 
again from the bottom to the top. Continuing in this way we can compute all the 
divided differences we need. Algorithm 52 shows pseudo-code for computing the 
divided differences for y; = f (x;). 


Once the divided differences Dy = f[X0, x1,...,X,] are computed for k = 
0,1, 2,...,2, to evaluate the interpolant at x, we evaluate 
P(x) = Do + Die — x0) + Dox — x0)(@ — x1) +++ + Da (& — X0)(% — x1) ++ — Xn-1)- 


As written, this requires }*7_9(1 + 2k) = (n + 1)? flops. But this can be improved 
by pulling out common factors for as many terms as possible: 


p(x) = Do + (x — x9) [Di + (& — x1) {Dz + — x2) (D3 +++ [Dai + (& — Xn—1)Dn] +++) }]- 


Pseudo-code implementing this “nested” approach is in Algorithm 53. 


In terms of floating point operations, Algorithm 52 requires ~ 3n? flops to 
compute the divided differences, while the interpolant evaluation algorithm, Algo- 
rithm 53, requires 3n flops for each evaluation of p(x). Algorithm 53 is reminiscent 
of Horner’s nested multiplication method for polynomial evaluation using the coef- 
ficients, which is shown in Algorithm 54. 
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Algorithm 54 Horner’s algorithm for p(t) = )-f_, axt* 
1 function horner(a, t) 
2 pval <— ay 
3 for k=n-1,...,1,0 
4 pval — ax +t - pval 
3) 
6 
7 


end for 
return pval 
end function 


4.1.1.11  Barycentric Formulas 


Alternative formulations for representing the interpolating polynomial include the 
barycentric formulas [21]. This starts with V(x) = [hoe — x,). Then we can 
represent the Lagrange interpolation polynomials by 


(4.1.11) Li(x) = ay pees # Xi, 
W' (x) (x — x7) 

(4.1.12) W'(x;) = | | @- =p. 
PIAL 


Then the interpolating polynomial has the form 


Wj 1 


(4.1.13) p(x) = V(x) Vi ; wi =. 
2 X— Xj W'(xi) 


This is the first barycentric form of the interpolating polynomial. 


n 


A better form comes from the fact that 1 = }>7_,) Li(x). This is the Lagrange 


representation of the constant polynomial f(x) = 1 for all x. The barycentric repre- 


sentation of one is 
n 
Wi 
1= W(x : 
(x) 2 8 


Substituting this formula for W(x) into the first barycentric form gives the second 
barycentric form: 


(4.1.14) p(x) = 
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This can be implemented with just O(n) flops with pre-computed weights w;; how- 
ever, care must be taken to avoid overflow and division by zero issues. This repre- 
sentation is used in the chebfun system [205]. 


4.1.1.12 Asymptotics of the Error 


Obtaining estimates of the error from the error formula for polynomial interpolation 
(4.1.7) is, in general, difficult to do precisely. Instead, we aim to obtain asymptotic 
estimates that are helpful in designing algorithms. 

We first consider the situation where we choose an interpolation pattern: £) < 
&) < +++ < &; the interpolation points are x; =a +h€; fori =0,1,2,...,n. We 
consider interpolation over an interval that is h-dependent: [a + h nin, @ + hEmax].- 
We assume that Smin < £9 < +--+ < &) < &max. For example, if we are considering 
equally spaced interpolation, we can take €; =i fori =0,1,2,...,”, Emin = 0, 
and max =. This scheme allows us to look at a number of other interpolation 
approaches, while still dealing with the same overall asymptotic behavior. 

As with most situations in numerical analysis, it is helpful to identify the things 
that one has control over, and the things that are given, or beyond our control. Here 
we control the €;’s, n, and h. But the function f is what it is. For this section, we 
focus our interest in the behavior of the error as h becomes small. 

For the interpolation points x; = a + h &;, our formula for the interpolation error 
is 


_ FMS) n 
F(R) — Pal®) = Toy) ie xj). 


We focus now on the maximum error over the interpolation interval [a + min, a + 
hEmax | : 


If) — Pa(x)| 


max 
xE[Ath Emin, A+hEmax | 


n 


f"*D(c,) To a 
j 


max 
xelath Emin d+hEmax! | (a + 1)! a 
j= 


n 


[[@ —Xj;))- 
j=0 


fT % (eX) 
(n+ 1)! 


IA 


max ax 
xelat+h §nins d+hEmax] XE[A+M Emin, A+hE max] 


Note that c, is between a + A€pin and a + AEmax. Now the maximum of | f+! (c,)| 
over x € [4 +h &nin, @ + AEmax] is not something we have much control over. We 
can see that provided f("*) is continuous, this maximum will approach | f"*! (a)| 
as h — 0. Since we have a finite limit, then for all h > 0 “sufficiently small” there 
is a bound [FFP o)| < M, forall c € [a +h &nin, a + hEmax]. This leads to 
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max Xp = Xx 
x€[a+h Emin, d+hEmax] | FC ) as ( ) | 
M n 
2 — max [[@ —x;)|. 
(n + 1)! xefath Emin, a+NEmax] 


j=0 
Writing x € [a +h&min, a+ AEmax] as x = a+ h€ and x; =a +h€; we get 


lf) — prlx)| 


max 
x€[ath Emin, A+hEmax] 


M, n 
ma 


= x 
(n + 1)! €€fEmins Emax 


] [a+ A8 — a+ ng))) 


j=0 


M. n 
ma 


(a + D! €elEnins Earl [Ta =€;)) 


j=0 


M n 
= prt! ” max —&;)}. 
Gt Di cadet a [LL - 6 


This leads to first asymptotic results: 


M a 
max x) — Pa(x)| < Ant! e max —& 
x€lath Emin, d+NEmax | IFO) — Pa@l S (n + 1)! €€lEmin, Emax] 1G oi) 
(4.1.15) =)" as K,(€)=O(h""!) ash > 0. 
(n+1)! 


Note that while 
the maximum error over interpolation interval is O(h"*') ash > 0 


is a simple and straightforward lesson, we should remember that there is a hidden 
constant that depends on n and the choice of interpolation scheme €. Nevertheless, 
it is a useful lesson. 


Example 4.6 As an example, we show the maximum error for polynomial inter- 
polation error for equally spaced interpolation (€; = 7) and different values of n 
and h applied to the function f(x) = e*//1-+ x near a = 0 in Figure 4.1.3. Note 
the increasing slope as the order increases; the curves do not stay even approx- 
imately straight, as the maximum error cannot go much lower than unit round- 
off. Estimates of the slopes, being the exponent p in the approximate relationship 
max error © constant - i”, derived empirically from the data of Figure 4.1.3, are 
shown in Table 4.1.2. While most of the slopes are close to n + 1, for higher degree 
interpolants, this precise relation is somewhat obscured by the floor of roundoff error. 
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10° 


10° 


max error 


10°10 


Fig. 4.1.3. Maximum error over interpolation interval for equally spaced interpolation points; 


ff) =e*//14+x near x = 0 


Table 4.1.2 Slope estimates for log—log plot of maximum error against spacing h 


Slope estimate | 1.999 3.098 3.870 4.682 5.573 
Asymptotic 2 3 4 5 6 
slope 


4.1.1.13. Runge Phenomenon 


Figure 4.1.3 shows how taking h — O affects the maximum error, for fixedn. Butif we 
vary both h and n together, this simple asymptotic relationship does not necessarily 
hold. What happens then depends very much on the function being interpolated, 
the interval over which it is being interpolated, and also the interpolation scheme 
(€, €1,---», &,) being used. In fact, the error can easily grow rapidly as n is increased, 
as Carl Runge discovered in 1901 [223]. 

The Runge phenomenon is the failure of convergence of polynomial interpolants 
for certain functions over a fixed interval as the degree of the interpolant n goes 
to infinity. Runge’s specific example was equally spaced interpolation of f(x) = 
1/ad+ x”) over the interval [—5, +5]. Plots for n = 10 and n = 20 are shown in 
Figure 4.1.4. Figure 4.1.5 shows how the maximum error of the interpolant behaves 
as the degree n increases for this case and for interpolating the same function over 
{[—2, +2], and over [—1, +1]. 

The moral of this story: reduce before increasing n. 
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Fig. 4.1.4 Plots of equally 
spaced interpolants for 
f(x) = 1/(1 +x?) on 
[-5, +5] 


1/(1+ 27) and its interpolant (degree 10) 


1/(1+ 2?) and its interpolant (degree 20) 
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4.1.1.14 Hermite Interpolation 


A variant of traditional interpolation is to use derivative information to specify the 
interpolant. The simplest of these variants is Hermite interpolation: given (x;, yi, y;) 
fori = 0, 1,2,...,, we want to find a polynomial p(x) of minimal degree where 


(4.1.16) 


P(X) = Vis 


p' (xi) = Y;, fori = 0,1,2,...,n. 


Note that this is a system of 2(m + 1) equations, so we seek the degree so that p has 
2(n + 1) coefficients; that is, we seek p with degree no more than 2n + 1. Ifa ¢ R™” 
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max error 


Fig. 4.1.5 Runge phenomenon: interpolating 1/(1 + x”) with equally spaced interpolation points 
over [—5, +5], [—2, +2] and [—1, +1] 


is the vector of coefficients for p, then the linear map F': a +> [p(x;; a), p’(xj; 4) | 
i=0,1,...,n] from R*” to R®” is one to one: if F(a*) = 0 then p(x; a*) is the 
polynomial of degree < 2n + | has zeros of multiplicity two at each point x;. Thus 
p(x; a*) = Meare: —x;)*- g(x). But then the degree of p is 2(n + 1) + deg g. This 
implies that deg g < 0, which can only occur if g(x) is identically zero. That means 
that a* = 0. Since F: R2“+) —» R2+D js linear and one to one, F is also onto. 
That means there is exactly one Hermite interpolant for the given data as long as the 
x;’S are distinct. 

Computing the Hermite interpolant of this data can be done using divided differ- 
ence tables using the identity f[x;, x;] = f’(x;): 


xo f (xo) 

Xo f(xo) flxo, xol = f’ (xo) 

x1 f(x1) f{xo, x1] f[x0, Xo, x1] 
x1 f(a) fli.) = ff’) Fx, 41, *1] 


x2 f (x2) f[x1, x2] flx1, x1, x2] 


wie). (era Jiawei 
Xn fn) Ff [Xn, Xn] = f'n) Ff [Xn-1, Xn» Xn] 


Initially, the column of function values is filled in with f(x;) appearing twice, and 
then the second column has every second entry filled with the derivative f’(x;) = 
F(x;, x;]. The remaining entries in the column for first-order divided differences are 
filled in using the standard formula. Once that is done, the remainder of the table can 
be filled in using the standard formulas. The evaluation of the Hermite interpolant 
can be done by a modification of the algorithm for standard interpolation: 
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Algorithm 55 Hermite divided difference algorithm 


1 function hdivdif(x,y,y’) 


2 do<yo: diy 
3 Xo <— XO; x1 <— X09 
4 for F152, 00650 
5 dy — (Yi — Yi-1)/@i — Xi-1) 
6 dri41 << y; 
7 Xo <— Xi X24. <— Xi 
8 end for 
9 for k=2;...,2n 
0 for i=2n,2n—1,...,k 
11 di — (dj — dj-1)/@ji — Xin) 
he end for 
3 end for 
4 return d 
5 end function 


Algorithm 56 Evaluation of Hermite interpolant 
function hinterp(x, D, t) 

2 for i1=0,1,2,...,n 

3 Xo <— xii X41 <— Xj 
4 end for 
5 

6 


return pinterp(x, D, t) 
end function 


p(x) = f (xo) + flxo, xol(x — x0) + flo, x0, x1)(% — x0)? ++ 
+ flx0,X0, «+ +5 Xn» Xnl(x — x0)? +++ (x — Xn—1)2(x — Xn): 


Pseudo-code for these are shown in Algorithms 55 and 56. 
There is also a Lagrange-type representation: 


n 


(4.1.17) p(x) = >> (yi Hie) + yj) Ki). 


i=0 


The polynomials H; and K; can be written in terms of the standard Lagrange inter- 
polation polynomials: 


Hi (x) = L;(x) [1 — 2L} 0) — x], 
Kj (x) = L(x)? (x — x). 


Note that Li (xi) = ei _ xj. 
The interpolation error formula has a similar form to the standard one: 


(2n+2) n 
(4.1.18) f(x) — p(x) = oor le x 
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for some c, between min(x, xo,...,X,) and max(x, Xo0,...,X,). While Hermite 
interpolation promises higher order accuracy, with an error of O(h?"*7) for fixed 
nas h — 0, it does not perform significantly better than standard equally spaced 
interpolation on Runge’s example. 

This idea can be generalized to picking an arbitrary number of derivatives to 
specify at each interpolation point. The extreme case is where one point is repeated 
as many times as appropriate in the divided difference formula: 


P(x) = f (x0) + flxo, xol(x — x0) + fLx0, x0, Xol(x — x0)? ++ 


1 
+ Sf lxo, sees Xo] (x ~ xo)" + Tx, XO, -005 Xol(x = xe) . 
a ee ite! 
n+1 times n+1 times 


Application of the Hermite—Genocchi formula (4.1.8) reveals that this formula is 
simply Taylor series with remainder in integral form. 


4.1.2 Lebesgue Numbers and Reliability 


We would like the result of interpolation to be close to the best possible approx- 
imation from our family of interpolating functions. But the Runge phenomenon 
(Section 4.1.1.13) shows that sometimes this is far from being true. In fact, inter- 
polation can be extremely bad. So how can we trust an interpolation scheme? 
Some functions are difficult to approximate by polynomials. Rough functions 
are harder to approximate by polynomials. Functions with isolated features, like 
f@) =1/d+ x) over the interval [—5, +5], are also hard to approximate by 
polynomials. But the Runge phenomenon shows that the maximum interpolation 
error can increase as the degree n increases, while the best approximation by poly- 
nomials of degree < n can only decrease as n increases. Why does this happen? 

We want to separate the question of how hard a function is to approximate by 
polynomials from the questions of the reliability of the interpolation method. 

Suppose that we have an interpolation scheme: given function values f(x;), i = 
0,1,2,...,7, we have an interpolant 


(4.1.19) Pf(x) = > FS (xi) €i (x). 


i=0 


In the case of polynomial interpolation, the @; functions are the Lagrange interpo- 
lation polynomials L; in (4.1.3). Other kinds of interpolation can be incorporated 
into this approach: trigonometric polynomial interpolation, spline interpolation, and 
multivariate interpolation of different kinds. 

We can now consider P as a linear function from continuous functions on [a, b], 
denoted C[a, b], using 
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Igloo = max [g(x)| 
a<x<b 


as the norm for continuous functions. Let P be the set of interpolating functions: 
P is the image of P. An essential property for the interpolation operator P is that 
for any q ¢ P, Pq =4@, that is, the interpolant of an interpolant is the original 
interpolant. More succinctly, P? f = Pf for any f € Cla, b]. So P? = P. That is, 
P is a projection. 

What is important here is the operator norm 


P 
(4.1.20) Pll = sup PA lheo = 


IPF loo + 
640 WF loo f:llflleo=1 7 


We can compute this operator norm: 


IPF lloo = max |Pf(x)| = max 
a<x<b a<x<b 


>> fi) 4) 
i=0 


< ee FADE) s yD I floc lee) 


n 
IIflloc max D> 1ei()1, so 
a<x<b 
i=0 
n 
I|Plloo = max)” |éj;(x)]. 
a<x<b * mi 
c— 


In fact, this upper bound is the value of || P||,,.: Suppose that 59 |€;(x*)| = 
maXg<x<b >~;—0 [€i(x)|. Then set f to be a function satisfying f(x;) = sign ¢;(x*), 
and | f(x)| < 1 for all x. We can do this by using piecewise linear interpolation, for 
example. Then for this f, || f||,, = 1 and 


> |S> fGi) &(x*) 
i=0 


SS ee) = IP ile lif soe 


i=0 


| Pf lloo = max 
a<x<b 


> FG) L(x) 
i=0 


S- sign €;(x*) + €)(x*) 


i=0 


Since the inequalities between || P||,, and maxg<,<, )-7_9 |€;(x)| go in both direc- 
tions, they must be equal. We call || P ||, = Maxa<x<p )>)_9 |€i(x)| the Lebesgue 
number for the interpolation method. 


Theorem 4.7 (Lebesgue numbers) For any f € C[a, b] and interpolation operator 
P on C[a, b], the interpolation error 
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If — Pflloo S$ A+ IP lloo) inf If — alloc - 
qeEP 
Proof For any g € P, 


If-—Pfllo=If-at+q-Pfill. but Pq =¢ since gq € P, 
=||f —¢+ Pq - Pflloo 
© IF = alle FIP @ =P 
= MIF — leg + ll Pllos IF — Glee 
= (1+ IlPlloo) IF — Allee» 


as we wanted. 


The important issue then is to estimate 


n 
I| Plloo = max ) |€;(x)| 
a<x<b* 5 

i= 


for any interpolation scheme we wish to investigate. Since for any interpolation 
scheme that is exact for constant functions, 1 = }77_) €;(x) so 1 < )7_y |€;(x)| and 
thus maxg<y<p a |€;(x)| = 1. This is not surprising as P is a projection: P = P?. 
For any operator norm ||P || = || P?| < ||P||* so either ||P || = 0 (and P = 0) or 
| Pll > 1. 

We first consider equally spaced polynomial interpolation. We focus on the case of 
interpolation on [0, 1]. Later we will see that the Lebesgue number does not depend 
on the interval [a, b]. For now, we note that for standard polynomial interpolation, 
£;(x) = L;(x), the Lagrange interpolation polynomials (4.1.3). 

Figure 4.1.6 shows )~/_) |Li(x)| for equally spaced interpolation (x; = i/n) for 
n = 20. Clearly the maximizing x is between x = Oandx = 1/n,and symmetrically, 
between x = | — 1/n and x = 1. Most of this is due to L; (x) fori © n/2, noti ~ 0, 
or i © n. Figure 4.1.7 shows |Z; (x)| for equally spaced interpolation with n = 20 
andi = 10. 

The Lebesgue number for equally spaced interpolation depends on the degree of 
the interpolation. How this changes with n is shown in Figure 4.1.8. 

Refined calculations give the asymptotic value of Lebesgue constants for equally 
spaced interpolation (denoted A,,) [231]: 


gn+ 


An ~ ——— asn —> oo. 
enlnn 


This exponential growth results in large interpolation errors. Even if the best approx- 
imation error goes to zero at an exponential rate as the polynomial degree n goes 
to infinity, the exponential growth of the Lebesgue constants will result in explod- 
ing interpolation errors unless the exponential decay of the best approximation is 
even faster. The example of the Runge phenomenon (Section 4.1.1.13) shows that 
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0 0.2 0.4 0.6 0.8 1 
x 


Fig. 4.1.6 5°) |Li(x)| for equally spaced interpolation points, n = 20 
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10° 


|L10(z)| 


0 0.2 0.4 0.6 0.8 1 
x 


Fig. 4.1.7 |L19(x)|for equally spaced interpolation, n = 20 


interpolation errors can grow exponentially even for functions that are infinitely 
differentiable. 

Using a different interpolation scheme, with a different distribution of interpo- 
lation points, we can get drastically different Lebesgue constants. For Chebyshev 


points (Section 4.6.2), 


a+b b-a_ (k+4)r 
Xp = + cos( 


» eso se, 
2 2 co . 
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Fig. 4.1.8 Lebesgue numbers for equally spaced polynomial interpolation 
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Fig. 4.1.9 Lebesgue numbers for Chebyshev polynomial interpolation 


we have the Lebesgue constants [214]: 


2 
Ay, ~ —Inn asn > oO. 
7 


Figure 4.1.9 shows the Lebesgue constants for polynomial interpolation with Cheby- 
shev points. 

The slow growth of the Lebesgue numbers for polynomial interpolation using 
Chebyshev points indicates the quality of these points for interpolation. Because of 
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0 10 20 30 40 


Fig. 4.1.10 Maximum interpolation error for f(x) = 1/(.4 x”) on [—5, +5], [—2, +2], and 
{—1, +1] using n + 1 Chebyshev points 


this slow growth, the Runge phenomenon does not occur with Chebyshev points. 
Compare the maximum errors using Chebyshev points as shown in Figure 4.1.10 
with those shown in Figure 4.1.5. 

Instead of having the interpolation error increasing, it is decreasing at an expo- 
nential rate. As noted in Section 4.1.1.13, it is still better to reduce spacing before 
increasing n. 

Lebesgue constants are also very useful in dealing with multivariate interpolation, 
which will be discussed in the following sections. 


Exercises. 


(1) Let f@&%) =e*/A+ x’). Compute the quadratic interpolant p(x) (or some 
representation of it) for this function using interpolation points x9 = 0, x; = 
1/2, and x. = 1. Plot the difference f(x) — p(x) for 0 < x < 1. For the plot, 
compute the difference f(x.) — p(x,) for k =0,1,2,..., N for x, =k/N 
with N = 100. 

Show that if xo, x;, and x2 are distinct, then p(x) := xoLo(x) +x, L1(x) + 
X2Lo(x) = x for all x. Here L;(x) are the quadratic Lagrange interpolation 
polynomials for these interpolation points. [Hint: Let f(x) = x; then p(x) 
is the quadratic interpolant of f(x) at xo, x1, x2. But f is also a quadratic 
interpolant of this data; by uniqueness of the quadratic interpolant, f(x) = p(x) 
for all x.] 

Let f(x) = e~*/(1+ x7). Forn = 1, 2, 3andforh =2-",m = 1, 2, ..., 8, 
compute the interpolating polynomial p of f(x;,) of degree <n where 
X,p=atkh,k=0,1,...,n, anda= 5. Estimate the maximum interpo- 
lation error for each pair (n, h): maxg<x<atnn | f (x) — p(x)|. Estimate this 


(2 


wm 


G3 


wm 
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(4 


(5 


(6 


Ym 


wm 


wm 


maximum error by evaluating f(x) — p(x) at 101 points equally spaced in 
the interval [a, a + nh]. Plot the maximum error against h for each value of n. 
You should use logarithmic scaling for both / and the maximum error. Estimate 
the slope of the plots. If €,,,, is the maximum error for degree n interpolation 
and h = 27, then we compute slope = (log(€n,¢) — log(€n,m))/Cog(he) — 
log(h»)) for suitable values of £ and m. These slopes should approximate the 
order of the interpolation error, which should be n + 1. Report the estimated 
slopes and comment on how close the slopes are ton + 1. 

Suppose that x,=a+&h and yy =b+n;h for k,€=1,2,...,n. Show 
how we can create an interpolating polynomial p(x, y) of the data p(x,, ye) = 
Zk.e Where the degree of x +> p(x, y) and the degree of y + p(x, y) are both 
<n. Re-use the functions for divided differences in one variable to implement 
your method. Compute the divided differences D; ; of z;,; with respect to y for 
each i, and then the divided differences of E;,; of D;,; with respect to x for 
each j. To evaluate p(x, y) we evaluate first 


m i- 


1 
Dj(x) =) Ei,;[[@—x), jF=0,1,...,0, and then 
i=0 k=0 


n j-l 
p(x. y) = >) Dj(x) | [GW — ye). 


j=0 t=0 


How many arithmetic operations are needed to compute the divided differ- 
ences, and how many to evaluate p(x, y) for each (x, y)? Test your method for 
the function f(x, y) = exp(—x? —y)+ cos(7x)/(1+x+ y) fora =b=0 
with equally spaced interpolation points (& =k, ne = €) forn = 2 and h = 
2~-” form = 1,2,...,5. What is the order of accuracy of this method? 

An alternative to the previous exercise is to use two-variable Lagrange inter- 
polation polynomials: 


Lij(x, y) = Li) Lj(y), 


where L;(x) is the Lagrange interpolation polynomial in x, and L jQ)) is the 
Lagrange interpolation polynomial in y. Use this to create an algorithm for 


computing 
m n 


P(x, y= ¥ eg Li,j (x, y). 


i=0 j=0 


How many arithmetic operations are needed to evaluate p(x, y) for each (x, y)? 
Test your code as described in the previous exercise. 

Extend Exercise 4 to functions of three variables. What are the problems of 
using this interpolation method for functions of d variables if d is large? 
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(7) Repeat Exercise 3 using Hermite interpolation. This does mean you will have 
to compute f’(x) symbolically. 

(8) The Runge phenomenon is explained in more detail in [86], which uses complex 
analysis to better show the exponential growth of the error. If the interpolation 
points are x9, X1, ..., X» € [a, b] and W,,(t) = Treo (t — Xx), show using the 
Residue Theorem of complex analysis that for t € [a, b] 


I [ WO @ 5 


Ti Jo Wy(z)z—-t = 


IO PaO = 5 


where C is a closed curve that goes once counter-clockwise around the inter- 

val [a, b]. Also show that if the interpolation points are equally spaced, then 

(1/n) In |W,(z)| > A/(b — a)) p In |z — t| dt asn —> oo for z ¢ [a, ]. 

The Chebyshev interpolation points (4.6.5) on the interval (—1, +1) are given 

by x, = cos((k + S)r/(n +1)) fork =0,1,2,...,n are the roots of the 

Chebyshev polynomial (4.6.3) T,,1 (cos @) = cos((n + 1)@). Derive the weights 

wz, for the barycentric interpolation Formula (4.1.14) for the Chebyshev points. 

[Hint: Use the fact that Y(t) = [at — x,) =2-" T,4,(t) and compute 

W'(xj) = 27" T, | (x;) via the definition T,,, (cos 0) = cos((n + 1)@).] 

(10) Derive the weights w,; for the barycentric interpolation formula for equally 
spaced interpolation points x, =k/n for k =0,1,2,...,n. [Hint: Write 
W’(x;) in terms of factorials.] 

(11) Suppose that we have interpolation points a < x9 < x1 <-++<X, <b and 
C<y < yy <-+-++ < yy, < d; then we can interpolate over the interval [a, b] x 
[c, d] using p(xz, ye) = f(x, ye) fork, =0,1,2...,n for given function 
jf. Suppose A,, is the Lebesgue number for the one-dimensional interpolation 
using Xo, X1,..-,X,, and M,, the Lebesgue constant for the one-dimensional 
interpolation using yo, y1,..., Yn. Show that A, M,, is the Lebesgue constant 
for the two-dimensional interpolation described above. 

(12) Use the standard algorithm for computing divided differences in double preci- 
sion on equally spaced points a = x9 < x1 < x2 <--- <x, =b for f(x) = 
exp(—V 1 + x?) on the interval [a, b] = [—4, +4] with n = 80. Close exami- 
nation of the plot of the interpolant reveals erratic behavior near x = +4 which 
indicates roundoff error (see Figure 4.1.11). What do you get if you reverse the 
order of the interpolation points? What if the interpolation points are randomly 
ordered? Can you give a deterministic ordering of the interpolation points that 
avoids problems with roundoff error? 


(9 


Ym 


4.2 Interpolation—Splines 


The Runge phenomenon shows how polynomial interpolation can fail. While the 
distribution of interpolation points over the interpolation interval can reduce these 
problems, it is not always possible to get data for these points. More reliable inter- 
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Fig. 4.1.11 Roundoff error x 10° 
in interpolation 


3.7 3.75 3.8 3.85 3.9 3.95 4 


polation methods are desirable. However, there are trade-offs: the rapid decay of the 
maximum error if the interpolation interval is small enough no longer holds. 

Spline interpolation, at least in one dimension, is piecewise polynomial interpola- 
tion. That is, given the interpolation points x9 < x1 < x2 <...< x, and data points 
(x;, vi), i =0,1,2,...,2, on each piece [x;, x;+1] the interpolant p(x) is a poly- 
nomial. This can be done using standard polynomial interpolation by using a subset 
{Xi-k, Xi-k+1,---, Xi+e} of neighboring interpolation points to give an interpolating 
polynomial of degree < k + @ that is used on [x;, x;;]. The piecewise polynomial 
created in this way is continuous: the polynomial used to interpolate on [x;_1, x;] and 
the polynomial used for [x;, x;+1] both interpolate the same value y; at x;. However, 
the derivatives at x; usually do not match. 

While piecewise polynomials with discontinuous derivatives can serve many pur- 
poses, the needs of some industries such as aerospace, vehicle manufacturing, and 
Computer-Aided Design (CAD), in general, motivated a search for more general 
interpolation method that gave better smoothness without the reliability issues of 
standard polynomial interpolation. 


4.2.1 Cubic Splines 


The simplest kind of splines are linear splines. A linear spline is simply piecewise 
linear interpolation: given the points (x;, y;) and (xj+41, yi+1) with x; < x;41 we set 
L(x) =yi Lio) + ai Lii@) for x SxS x41 where Ljo(x)= 
(x — xi41)/(4%; — Xi41) and Lj.) (x) = (x — x;)/(%i41 — x;) are the corresponding 
Lagrange interpolating polynomials. There is another way of looking at linear splines: 
linear splines minimize 
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(4.2.1) [ eorax subject to €(x;) = y;, fori =0,1,2,...,n. 


x0 


Linear splines £ are continuous, but the derivative ¢’ is generally discontinuous at 
each interpolation point x;. Standard polynomial interpolation can be applied on each 
piece [x;, x41]; if y; = f(x;), the polynomial interpolation error formula (4.1.7) can 
applied to give: 


1 
f(x) — &@) = ah Cid = )e= ea) =O Gia =a). 


Cubic splines [71] are piecewise cubic functions s that are continuous and have 
continuous first and second derivatives. A cubic spline interpolant s for data points 


(xj, vi), i =0,1,2,..., with x; < x;41, is a cubic spline where s(x) is a cubic 
polynomial for x; < x < x,;,,. We can represent s(x) = s;(x) for x; < x < x;+, for 
i =0,1,...,2— 1, where s;(x) is a cubic polynomial. We can represent 

(4.2.2) si(%) = ane — xi)? + Bi — 4)? +(e — mi) +d. 


The problem then is to determine the unknown coefficients a;,b;, c;,d; for i = 


0,1,2,...,2 — 1. Thisisa total of 4n unknowns to find. The equations to be satisfied 
are 

(4.2.3) SM) =yi, S41) =yYia1 fori =O0,1,2,...,n—1, 

(4.2.4) S(Xi41) = Spy Geet); fori =0,1,2,...,n—2, 

(4.2.5) Sf (X41) = Si Open): fori =0,1,2,...,n—2. 


Equation (4.2.3) represents the interpolation conditions. Note that s;(x;+1) = yi-1 = 
Si+1(Xj41), SO the interpolation conditions ensure that s(x) is continuous. Equations 
(4.2.4), (4.2.5) ensure continuity of the first and second derivatives at the “joints” x ;, 
j =1,2...,n— 1, where adjacent pieces meet. 

The total number of equations is therefore 2n + 2(n — 1) = 4n — 2. Thatis, there 
are two more unknowns than equations. We can specify a unique interpolant with 
two additional conditions, provided the linear system is invertible. There are many 
ways of adding two additional equations. The most commonly used are 


Natural spline: s"(xo) = s"(X,) = 0. 
Clamped spline: s'(xo) = yo and s’(x,) = yy, for given yo and y),. 
Not-a-knot spline: s(x) continuous at x = x; and x = X,_1. 


e 
e 
e 
e Periodic spline: provided yo = yy, we set s’(xo) = s’(X,) and s"(x9) = 5” (Xp). 


The natural spline is, in fact, the solution s to the optimization problem 


(4.2.6) min f s"(x/?dx subject tos(x;)=y;, i=0,1,2,...,n, 
Ss 
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where the minimum is taken over all s where s” is integrable. Clamped splines are the 
minimizers of (4.2.6) with the additional constraints that s’(xo) = yg and s’(x,) = y). 
Periodic splines are the minimizers of (4.2.6) with the additional constraints that s 
is periodic with period x, — xo. 


4.2.1.1 Computing Cubic Splines 


Each of the four kinds of splines identified above can be solved via linear systems. The 
task here is to explicitly give a system that can be efficiently and accurately solved. 
The approach taken here follows [13, 241]. Since the spline s(x) is piecewise cubic 
with continuous first and second derivatives, then s” (x) is continuous and piecewise 


linear. Let M; = s”(x;),i =0,1,2,...,. Then for x; < x < x;4, we have 
x — Xj; xX — Xj; 
" ”" i+] i 
Ss" (x) = s; (x) = M; ——— + Mj., ——_, 
Xj — Xi+1 Xi41 — Xi 


using linear Lagrange interpolating polynomials. This choice of variables ensures 


that s” is continuous. Leth; = x;4; — x; fori = 0,1, 2,..., — 1. Integrating twice 
we get 
M; 3 Mi+1 3 
Si(X) = 6h, (X41 — XY + 6h, (x = xj) + Cin — x) + Di(& — x;) 


with two constants of integration C; and D;. The interpolation Equation (4.2.3) for 
5; becomes 


Yi = Si (Xi) = a hp + M+ 93 + Cjhj + Dj0 = : ih; + Chi 

6h; ' 6h; 6 : : 

= = Mpg Bs sch 4 Dy SM A DA 
Yin = Si i41) = a or Shy + C)0 + Dihj = gn + Djh;. 


These equations allow us to solve for C; and D;: 
1 1 
Ci = Oi /hi) -— giaihi D; = (vii /hi) — 6 Mitihi. 


We still have to ensure continuity of the first derivative of s(x): sj_, (xi) = s/(x;) for 


i=1,2,...,2— 1. The derivative s;(x) is 
; M; ., Misi 3 
5;)(X) = ah, (X41 —X)° + Dh, (x — x)) + Dj — Cj 
M; Mi+1 Yai-yi 1 
= R (xi41 — x)? + Dh, (x — x)? + 7 6 (Mint M;,)hj. 
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Thus 
, M; Yar — yi 1 
S;(x;) = ah, he + hh 5 (Mitt M;)h;, and 
Mi yw-yi-1 1 
‘(x)= h2 M; — M;_1)hy_ 
5;_ 1%) is pk rs AC Dhi-i 


Equating these derivatives gives 


1 1 1 itl — Yi Yi — Yi-1 
4.2.7 —h;_,|Mj_, + =(h;_| + h;)M; + -h;M;., = 
( ) Pre ace 36 1 ) r +1 i - 


fori = 1,...,n — 1. Weare still missing the two additional equations, which depend 
on the type of cubic spline. 

e Natural spline: Mp = M, = 0. 

e Clamped spline: 5§(xo) = yg and s,_)(%n) = yy, so 


pie = inh ae Pei aah jj (5m +5) 
hig 0 2 eli 70) 6 1 OJFLO 0 3 0 6 l]> 
y, 4 cE = Eins — Lhe, = M, Dan 1= hn 1 (5m 1 + a) . 
e hn-1 2 6 6 3 


e Not-a-knot spline: (M,—Mpo)/ho=(M2 — M1)/h; and (M,_-1 — My~2)/hn-2 = 
(M,, ~~ Mn-1)/ha-1.- 
e Periodic spline: Mp = M,, and 


yYi-yYo Ya—Yn-1 1 1 1 
= =hy_|My_-1 + =(hn_-1 +ho)Mo + =hoM,, 
iho i gin! 1 3° 1 0) Mo Pie 


assuming y, = yo. 


The equations to solve are tridiagonal (see Section 2.3.1) for the natural and clamped 
splines. The equations to solve for the not-a-knot and periodic splines are rank-2 
modifications of tridiagonal matrices for which we can apply the Sherman—Morrison 
(2.1.16) or Sherman—Morrison—Woodbury (2.1.17) formulas. 

To see that the linear systems are invertible we start with the natural spline equa- 
tions: 


3(h0 +a) gh ; M, F 
gi g(t tho) gho | M> ] by 
gio (In +h) Ms |_| by 

. : dst 
a Mn-1 Pn 


ghn-2 5(hn-2 + hn-1) 
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This matrix is symmetric and also diagonally dominant as i (hx + hei) > thy + 
th k+1- Since the matrix is diagonally dominant, it is invertible, and LU factorization 
does not need partial pivoting (see 2.1.4.4). 


4.2.1.2 Error Estimates for Cubic Splines 


We can get optimal O(h*) error estimates for clamped cubic spline interpolation. 
In fact, we can also get error estimates for the kth derivatives up to k = 3 of order 
O(h*-*), which is also optimal. There is a small catch, which is that we need a bound 
on max; max(hj_1/h;, h;/h;—-,) and on max; h;/ min; h;. Note that similar results 
can be obtained for the “not-a-knot” and periodic splines. Natural splines still have 
good behavior, thanks to the rapid decay of errors from the endpoints to the interior. 


Theorem 4.8 If f has continuous fourth-order derivative and s is the clamped 
cubic spline interpolant of s(x;) = f(x;), i=0,1,2,...,n with a=xo9 <x, < 
sh << Xp_1 < X, = b then 


max | f(x) —s()| =O Kh"), and 


max, FP mM =") SOCK PAY), for jp =1,2,3; 


as h > 0 where L = maxg<y<s | f(x)|, K = max; max(hj_)/h;, h;/h;1), and 
h = max; h;. The hidden constants in the “O” are absolute constants except for 
j =3 where there is an additional factor of h/ min; hj. 


Proof Let us assume that y; = f(x;) and y; = f’(x;), etc. To get the error esti- 
mates started, we need to show that M; = f”(x;) + O(h?) = y! + Oh?) with 
h = max; h;. The starting point for this is the linear system we have generated. 
Let us write the clamped cubic spline equations (4.2.7) as AM = b: 


(4.2.8) 
1 1 
i. 1 ho 1 Mo bo 
go 340 + A1) gt My ii 
bhi Sf (hi +ho) M2 = bo 
: . 1 i i 
. zhn-2 
olin 
ghtn 2 3(hn 2+ hn-1) ghn 1 Mn-1 Pn—1 
Ly Ip Mn bn 
6n-1 3/ln-1 


where bj = (yi+1 — yi)/hi — Qi — Yi-1)/hi-1 fori = 1,2,...,2-—1,b9 = (ni - 
yo)/ho = Yoo and b, = vy, aE On ae Yn—1)/ha-1- 

The first step is to show that the inverse of a scaled version of our matrix is bounded 
independently of h > 0. If we multiply (4.2.7) by 1/./hj_ih; we get 
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1 [hy re Bis ie hey 8 
6 Ms kj hj_yhj 


fori = 1,2,...,n—1. For the first and last rows of (4.2.8), we can multiply by 
1/ho and 1/h,_,, respectively. The scaled matrix A’ is diagonally dominant and the 


diagonal entries satisfy 
1 hi-1 hj 2 
oe 25s 
3 hj hi-1] ~~ 3 


Write A’ = D’ — B where D’ is the diagonal of A’ so that 


(4) =(p'- 8)" =(p'[1-(@y" B)) 
=[1-(py a] 
= E +(D')'B+((D) 'B) 


2 


a (oy B) i | (ay, 


But |(D)' 8) <1/2,andso|(4’y'| <2 ](o)"|_ <3. 
oe) o.e) lo.) 
Now if we let F; = f”(x;) = y/, then to estimate ||M — F'||,,, we first compute 
||A (M — F)||,,. Estimating ||A (WM — F)||,, is an exercise in Taylor series. Writing 


f(x) as e, we see that we can write the Taylor series as 


=y+h 4 opBy! 4 day "4 a 
Yitl = Vi iv; 2 iy, 6 i Ji 4 ili» 


1 1 Ww 1 
Yi-1 = Yi — hi-1y; oF shi 1, — ahi i +o Nis 
mw 


” ” 1 
Vinr =) +hiy; + shi bis 


1 ss 
Yi = yf — Airy" + ste Lis 


where 7, 7, /4i, i are fourth derivatives of f evaluated at certain points in the 
interval (x;-1, x;+1). Then noting that y := A(M — F) = AM— AF =b- AF, 
we have fori = 1, 2,...,n—1, 


Yisi- i Yi Vi-1 hit» hiith , hi, 
rm Tc 6 Yi + 3 yr 6 itl 


25 hp. Mpg he 2 G8 ee ” ae WY —_ 73 
= 2 ij 6 iJi 4 ii 2 i-1)j 6 i-1Ji 24 i- Ni 


hi 1 yt” hi} + hj " hj ” 
6 im i+ 3: + 6 itl 
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1 1 
= 5 hi thiadyl + & (hj — hia) yi" + O (LG? +h.) 
1 1 
5 hi + hia) yf — (hi — hi) 97" +O (Lh; + h_)) 
(4.2.9) =O(L(h}+h3_,)), 


where L := max, | fOO) |. After scaling from A to A’, we change the linear system 
to A’ (M — F) = 7’ with 

h3_, +h} 

VJ hjphi 


iy tN 
const L (Ao + h?) max ( A > ) 


|>4| < const L 


IA 


for all i. Then using the bound on (a), we obtain 


|| M — Fl, < const Lh? K\/? 
where K = max; max(h;_1/h;, hj/hj—1). That is, M; — f"(x;) = O(L K'/* h?). 

Since f is continuous and s”(x) is the piecewise linear interpolant of M; ~ 
Jf (%;), the error in s” (x) can be estimated by the error in piecewise interpolation of 
f(x), which is O(L h”) plus the error due to the approximation M; ~ f”(x;). Thus 
the error in s(x) is M; — f"(x;) = O(L K"/7 h?). 

The error in s(x) = (Mi41 — M;) /h; for x; < x < x;41 can be estimated by the 
errorin f(x) © (f" i41) - f"(xi)) /h; plus bounds for the errors in M; © f”(x;) 
over h;, which gives a bound of O(L K'7h (h/ min; h;)). 

Since s(x;) = f(x;), for all i, by Rolle’s theorem, there is a point €; € (x;, X41) 
where s’(€;) = f’(€;). Integrating from €; to the remainder of the interval Es Xi] 
using the error bound for s” shows that the error in s’ is only O(L K'/? h?), Finally, 
integrating this error from x; to x € Ee Xi+ 1] shows that the error in s(x) is 
O(L K'/7 h4), 


It should be noted that this proof indicates that it is possible to accurately 
interpolate less smooth functions by adapting the spacing h; = x;41 — x; accord- 
ing to the size of f(x) for xj, < x < xj41. The estimate of 7; (4.2.9), where 
A(M — F) = +4, canbe controlled by O(max,,_,<+<x,,, | f (x)| (h? + A}_,)). Scal- 
ing by 1/./hjh;-1 increases this bound on 7; to O(max,,_<+<x;,; [fF (x)| ki 
(h? + h?_,)). However, by reducing h; © h* faa)? gives y; = O(K'/?(h*)?). 
The size of K can be controlled by making sure that the spacing does not change 
dramatically between adjacent interpolation points. The error ||M — F'||,, in Mi © 
F, = f"(x;) is then O((h*)*), and corresponding error bounds can be found for s” (x) 
and the lower derivatives of s. 

One of the advantages of cubic spline interpolation is that the error due to the 
perturbation of one data point decays exponentially rapidly as you move away from 
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(A) Not-a-knot splines (B) Zoomed not-a-knot splines, 
n = 4,8,16,32 n = 8,10,12,14,16 


Fig. 4.2.1 Decay of perturbations for equally spaced cubic spline interpolation 


the perturbed data point. Note that the decay rate is independent of h as a rate per 
interpolation point. Figure 4.2.1 shows not-a-knot cubic spline interpolants for the 
data points (x;, yj), = 0,1,2,...,n, with x; =ih and y; = 0 except for y,/2 = 1 
(taking n even). The perturbation therefore occurs at x = 1/2. The claim that the 
perturbations decay at a roughly constant rate per interpolation point can be seen in 
Figure 4.2.1(b). 

The rate of decay can be determined for constant h by using (4.2.7) with zero 
right-hand side: 


1 1 1 
=hM;_\ + =(h + h)M; + geMint = 0, SO 


6 3 
: ee M4 uk =0 
6 i-1 3 i 6 i+1— Vv: 


Solving this linear recurrence relation we get M; = air] + ar; where r; and rz 
are the roots of the characteristic equation é + ar + ir? = 0; these roots are —2 + 


/3. Note that r, -r2 = 1, so we can write M; = ary + agr,!. Around a perturbed 
interpolation value, the perturbations decay with a factor of roughly |-2 +43 | x 


0.268 per interpolation point. 

The reliability of spline interpolation can be confirmed by estimation of the 
Lebesgue constant for equally spaced not-a-knot spline interpolation. The functions 
L;(x) are the interpolants of L;(x;) = 1 if i = j and L;(x;) = 0 if i A j. These 
Lebesgue constants are shown in Figure 4.2.2. 

As apparent in Figure 4.2.2, the Lebesgue constants do not grow rapidly with n. In 
fact, they appear to be bounded with a bound a little less than two. Using n = 1000, 
the Lebesgue constant computed for not-a-knot splines is about 1.965. This combined 
with the exponential decay of perturbations makes cubic spline interpolation very 
robust. 


4.2 Interpolation—Splines 265 


2 


1.95 


1.9 


1.85 


oO 
oa 
= 
Oo 
= 
oa 
ye) 
oO 
ie) 
a 


30 


Fig. 4.2.2 Lebesgue constants for not-a-knot spline interpolation with n > 4 and equally spaced 
points 


0 0.2 0.4 0.6 0.8 dl 


Fig. 4.2.3. Overshoot of not-a-knot spline interpolation n = 9, 19, 39 


Notwithstanding the excellent character of cubic spline interpolation, it still has 
some oscillation when interpolating discontinuous data. For example, for a step 
function interpolated by not-a-knot cubic splines, we obtain the results shown in 
Figure 4.2.3 forn = 9, 19, 39. As we can see, the overshoot does not go to zero as 
n—> OO. 
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4.2.2 Higher Order Splines in One Variable 


There is a natural generalization of cubic splines in one variable based on the min- 
imization principles (4.2.1) for piecewise linear interpolation and (4.2.6) for cubic 
splines. We can ask to find 


(4.2.10) min s"(x)? dx — subjecttos(x;)=y;, i=0,1,2,...,n. 
Ss ty 


The minimizer s of (4.2.10) is piecewise quintic (fifth order) polynomial that has 
continuous fourth derivatives (not just third derivatives). Combining these conditions 
with the interpolation property gives 6n — 4 equations for 6n unknown coefficients 
of s(x) = aj;x° + b)x4 + ;x3 +.d)x* +ejx+ f, for x; <x < x;4,. What four extra 
conditions are imposed give different types of quintic splines. Natural quintic splines 
have the properties that s’(x) = s”(x) = 0 at x = x9, X,; clamped quintic splines 
have the properties that s’(xo) = yo, 5 (xo) = yg, 8’(%) = y),, and s"(x,) = yy’ are 
all specified. Computing one of these quintic splines given the interpolation data 
involves solving a penta-diagonal (5-diagonal) symmetric positive-definite linear 
system. 

Quintic splines have the advantage of slightly better smoothness and less oscilla- 
tion in the interpolants. They have the disadvantage of greater computational cost to 
implement. 


Exercises. 


(1) Use not-a-knot cubic splines to interpolate f(x) = e~*/(1 + x) over [0, 1] using 
n + 1 equally spaced interpolation points withn = 5, 10, 20, 40, 100. Estimate 
the maximum error between f and the spline interpolants using 1001 points 
equally spaced over [0, 1]. Plot the maximum error against n. Estimate the expo- 
nent a where the maximum error is asymptotically C h®. Does this confirm the 
theoretical error estimate of O(h*)? 
Numerically estimate the Lebesgue constant of not-a-knot spline interpolation by 
finding maxo<,<1 aa |€,(x)| where €; is the not-a-knot spline function inter- 
polating ¢,(x;) = lif j = k and zeroif j A k. Use equally spaced interpolation 
points x; = j/n for j = 0,1, 2,...,n. Do this forn = 5, 10, 20, 40, 100. 
To see the exponential decay of perturbations, compute the not-a-knot spline 
interpolant of the data y; = 0 for0 < j < N except that yy/2 = 1 assuming N 
even; also set x; = j, j = 0,1,2,..., N. Do this for N = 100. Estimate the 
exponential rate of decay of the spline interpolant s(x;) as | jj — N/2| increases. 
Repeat this with x; = j/N. 
(4) Compute not-a-knot, clamped, and natural spline interpolants of f(x) = 
e-*/(1 + x) over [0, 1] using + 1 equally spaced interpolation points withn = 
5, 10, 20, 40, 100. Plot the errors for the different interpolants. Which has 
smallest maximum error? Where are the differences between the different ver- 
sions of cubic spline interpolants? 
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Show that natural splines minimize Se ee [s ‘ (x)] dx subject to s(x;) = y; for 
7=S0,41;2;: . [Hint: First show ‘that between consecutive interpolation 
points, 5” @= = 0, so the function must be piecewise cubic. Do this by using 


A ‘ [s"(xy] dx < ‘ [s"(x) + tn" (x)] dx 


xo xo 


for all t € IR and 7(x) smooth and 7(x;) = 0 for all j. Use integration by parts 
on each piece [x;, x;41] to put the derivatives on s.] 


Show that clamped splines minimize 1 7 [s "eo dx Se os to s(x;) = yj i for 
j =0,1,2,...,n, and the clamping conditions s (xo) = yg and s'(x,) = y/. 
Use equally speed interpolation points and not-a-knot splines to interpolate 
f(x) = ./x on[0, 1]. Usen + 1 interpolation points forn = 5, 10, 20, 40, 100. 
Plot the maximum error against the spacing h = 1/n in a log-log plot. What 
relationship do you see between h and the maximum error? 

For the task of Exercise 7, instead of using equally spaced interpolation points, 
we can try using a graded mesh to deal with the singularity of ./x at x = 
0, use interpolation points x, = hk? /N7 for k =0,1,2,...,N. Use y= 
1, 15, 2. 25, 3 and N = 26, ¢=1,2,..., 10. Plot the maximum error against 
N for each value of + used. Estimate the exponents a where the maximum error 
appears to be © constant N~“ for each value of + used. 

Do two-dimensional spline interpolation as follows: Given interpolation points 
xj, i=0,1,...,M, in the x-axis and y;, j =0,1,2,...,N, and values to 
interpolate z;;, we want a function s(x;, yj) = z;j. For each j, find a vec- 
tor representation r; of the splines along the x-axis with y = y;. There must 
be a universal function spline(x; r, ¥) that computes the value of the spline 
represented by r at the point x with interpolation points ¥. Apply spline 
interpolation to create the vector-valued spline function r(y) where r(y;) = rj. 
Then s(x, y) = spline(x; r(y), ¥). Implement this approach in your favorite 
programming language. Test your implementation for interpolating the func- 
tion f(x, y) = exp(—x? —xy- sy?) over the rectangle [—2, +2] x [—2, +2] 
with M = N = 2‘, =2,3,..., 6. [Note: It is important that your representa- 
tion is a linear one. That is, spline(x; ayr, + ad2r2, X) = a,spline(x; r1}, X) + 
anspline(x; r2, X) for any a, az, andr), r.] 

Show that fifth-order natural spline interpolants for interpolation points a = 
XQ <Xy <+++ <Xy_-1 < xy = bcanbe defined by minimizing 5 [so]? dx 
subject to the condition that s(x;) = y; fori = 0, 1,2,..., N. Show that these 
fifth-order spline functions are piecewise polynomials of degree < 5, and are 
continuous with continuous derivatives up to the fourth order. 
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4.3 Interpolation—Triangles and Triangulations 


4.3.1 Interpolation over Triangles 


In one dimension, the basic shapes are usually very simple: intervals. In two dimen- 
sions, there is a much greater choice, and in three dimensions, the set of basic shapes 
is even larger. 

In two dimensions, we focus on triangles. Polygons can be decomposed into 
triangles. Domains with curved boundaries can be approximated by unions of non- 
overlapping triangles. The triangles in the union should be “non-overlapping” at least 
in the sense that intersections of different triangles in the union are either vertices or 
edges. In three dimensions, we focus on tetrahedra, and simplices in four and higher 
dimensions, where similar methods and behavior apply. 

For many calculations, it is convenient to use barycentric coordinates to represent 
points in a triangle: 


(4.3.1) xX = A;,vy + Adv2 + AZV3 where 
0 < Aq, Ao, A3 and Ay + A2 + A3 = 1, 


and v1, U2, v3 are the vertices of the triangle. Note that the triangle T with these ver- 
tices is the set of all convex combinations of v;, v2, v3; we write T = co {v 1, v2, v3}. 
Furthermore, each point in T can be represented uniquely in the form of (4.3.1). The 
barycentric coordinates for a given point x € T are written as Aj (x), A2(x), A3(x). 
Note that the vector of barycentric coordinates A(x) is an affine function of x: 
A(x) = Ax + b for some matrix A and vector b. 

Given a function f: T — R we have the linear interpolant given by 


(4.3.2) P(x) = f (v1) Are) + f(v2) A2(¥) + f(v3) Az (x). 


Computing the barycentric coordinates can be done using some linear algebra: 
since x = [v, | v2 | v3]X and 1 = [1 | 1 | 1]A we can combine them into 


HEF if a 
x V1 U2 V3 
-1 
— a i B . 
V1 U2 V3 x 

Every linear polynomial in two variables (x, y) has the form a; + a2x + a3y; the 
space of linear polynomials in two variables has dimension three. 

The space of quadratic polynomials in two variables has dimension six: a; + 
ayx +a3y + a4x? + asxy + agy?. So a quadratic interpolation method will require 


six data values. These are typically taken to be the points shown in the middle triangle 
of Figure 4.3.1. In general, if Pa is the space of polynomials of degree < k in d 
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Fig. 4.3.1 Triangles showing Lagrange interpolation points 


variables, then 


k+d k+d 


For quadratic interpolation over a triangle, we use the vertices v; and the midpoints 
v;; = (v; + v;)/2 for the evaluation points. Then the quadratic interpolant can be 
expressed in terms of barycentric coordinates as 


3 3 
(43.4) polx)= > f@)AQi-D+ D> fw) 4,rj- 


i=l ij=lri<j 


Cubic interpolation uses the vertices v;, points on the edges v;; = (20; + v;)/3 (so 
that v;; A vj;) and vj23 = (v1, + V2 + v3)/3 (which is the centroid of the triangle). 
The cubic interpolant can be expressed as 


(4.3.5) 
oS 1 Oy = 1 

pe) = 5D fM)AAI-~ VPAI-~DV+Z DE FPA DAA + 
i=l ij=liAj 
+27 f (v123) AtA2A3. 


This can be continued to higher degree interpolation on a triangle. 

In general, we can create a polynomial interpolant of degree < k on a triangle, 
using interpolation points where each barycentric coordinate \; is a multiple of 1/k. 
This gives Ce) interpolation points as desired. This scheme for interpolation over 
triangles is called Lagrange interpolation. This can be extended to tetrahedra in three 
dimensions, and to simplices in even higher dimensions. 

Finding a set of interpolation points in two or more dimensions is a more complex 
issue than for one dimension. In one dimension, as long as there are k + | distinct 
points, there is one and only one interpolating polynomial of degree < k. However, in 


two or more dimensions, having dim P;q distinct points does not guarantee existence 
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or uniqueness of an interpolating polynomial. Consider, for example, having all 
the interpolating points on a circle in two dimensions: (x — a)* + (y — b)* = c’. 
There are clearly infinitely many points on this circle. Even though, for example, 
dim P22 = 6, if we choose six distinct points on this circle, then uniqueness fails as 
any multiple of (x — a)? + (y — b)? — c? can be added to a given interpolant and 
we still have an interpolant. Also, existence fails as the dimension of all quadratic 
polynomials of two variables restricted to this circle has dimension no more than 
five. That is, if the values of five of the six interpolation points are known, then the 
value at the sixth point must be a linear combination of the values at the first five. 
Even if not all six points are exactly on a circle (or ellipse, or hyperbola), if they 
are close to being so, then the interpolation problem becomes ill-conditioned: small 
perturbations are amplified greatly. 

For determining if a set of interpolation points is suitable for a space of polyno- 
mials P, such as Px,q of all polynomials of degree < k in d variables, we can use 
the following theorem. 


Theorem 4.9 A set {x1,X2,...,xXy} of distinct points is a set of acceptable inter- 
polation points for the vector space P of interpolation functions if and only if 
N = dim P and there is no non-zero p € P where p(x;) = 0 for alli. 


Proof Let {¢), ¢2,..., dv} be a basis for P. Consider the linear transformation 
T: RX — RY given by (Tc); = poe c; 0; (x;). The set {x1,X2,...,Xy} is a set 
of acceptable interpolation points if and only if T is invertible. Since the matrix of 
T with respect to any basis is N x N, T is invertible if and only if the only c where 
Tc = 0isc = 0. That is, the only p = se cj oj; € P where p(x;) = 0 for alli is 
when p = 0, as we wanted. 


The contrapositive of Theorem 4.9 indicates that if there is a p € P that is zero at all 
interpolation points, then the interpolation points are not acceptable. For example, 
for quadratic interpolation over two dimensions, for example, the vertices of a regular 
hexagon do not form an acceptable set of interpolation points, as they all lie on a 
circle. Fortunately, we can guarantee the existence of acceptable interpolation points. 


Theorem 4.10 Suppose that P is a vector space of analytic functions! (which 
includes all polynomials) with 0 < dim ’P < 00, then any D with positive volume in 
R¢ has a set of acceptable interpolation points. 


Proof Let {@,, ¢2,..., dv} beabasis for P. Then {x;, x2,..., Xy}isaset of accept- 
able interpolation points if and only if the matrix A(x1,X2....,xw) = [@; a1 
is invertible, or equivalently, det A(x), x2....,xX x) 4 0. Since the ¢; are analytic 
functions, det A(x, X2....,Xy) is an analytic function of (ets a. ee eel The 
zero set of an analytic function [178] has zero volume. Therefore, as D™ has 


'A function f is analytic on D if for every a € D the Taylor series of f using derivatives at a 
converges for all x within a positive distance of a, and f is equal to its Taylor series where the 
Taylor series converges. 
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Fig. 4.3.2 Cubic Hermite interpolation on a triangle 


positive volume volyq(D™) = voly(D)* > 0, there must be points in D™ where 
det A(x,,X2....,Xy) #0 and so {x;,x2,...,xXy} is a set of acceptable inter- 
polation points. Furthermore, if det A(x},x5....,x) = 0 there must be points 
X1,X2,...,Xy with x; arbitrarily close to x} for which det A(x}, x2....,xN) #0 
and so {x1,X2,...,Xy}1s a set of acceptable interpolation points. 


It should be noted that simply having existence and uniqueness of interpolants does 
not mean that a given set of interpolation points is a good set of such points. The 
quality of a set of interpolation points should be measured in terms of quantities such 
as the Lebesgue number (see Section 4.1.2) for an interpolation scheme. 


4.3.1.1 Hermite Triangle Element 


We can create a kind of cubic Hermite interpolant on a triangle, interpolating first 
derivatives as well as values. This is illustrated in Figure 4.3.2. Note that “@” indicates 


that the value is interpolated at this point, while (@)» indicates that the value and 
the gradient are interpolated at this point. The point in the interior of the triangle at 
which the value is interpolated is the centroid of the triangle. 

For general Hermite interpolants we do not restrict the order of the derivatives 
used, although this is usually much less than the degree of the polynomial inter- 
polants. We do assume that there is an interpolation operator Z: C'(K) > C*(K) 
where C‘(K) is the set of all functions f: K — R where all derivatives of order 
< @ are continuous. For Hermite interpolation as described above (Figure 4.3.2), we 
take € = 1, while Lagrange interpolation schemes (Figure 4.3.1) have £ = 0. 

The Hermite cubic triangle interpolation system uses the values and gradients 
at the vertices together with the value at the centroid to uniquely specify the cubic 
interpolant. Each vertex contributes three interpolation conditions (the value plus 
two derivatives), so the centroid value provides the additional condition needed as 
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U3 


4 TK (2) = Ax® = bK 


V3 = (0, 1) v2 
Vi 


0, = (0,0) B» = (1,0) 
Fig. 4.3.3 Affine transform from reference triangle K to triangle K 


dim P32 = 10. These basis functions can be represented in terms of barycentric 
coordinates. For example, the basis function that is one at the centroid with value 
zero and zero gradient at the vertices is given by 27 A; 2,3 in terms of barycentric 
coordinates. 


4.3.1.2 Reference Triangles 


A way to prove properties about interpolation methods over triangles is to start with 
a reference triangle or reference element K, and then transfer properties from the 
reference triangle K to a given triangle (or other shape) via an affine function or 
transformation: T x(*) = Axx + bx with Ax an invertible matrix. For example, 
we can take the reference element K to be the triangle with vertices 0; = (0, 0), 
V2 = (1, 0), and 03 = (0, 1). For a triangle K with vertices v,, v2, v3, we can set 
by =v, and Ax = [v2 — vj, 03 — v4]. This is illustrated in Figure 4.3.3. 


Any function f: K — R can be represented by a function f: K=oR given by 
(4.3.6) FR) = f(Axk + bx). 


It is easy to see that f is a polynomial of degree k if and only if fis a polynomial 
of degree k. Note that x = A, v; + A2v2 + A303 is a representation of x € K by 


barycentric coordinates, and x = T x (x), then the barycentric coordinates of ¥ in K 
are also (Aj, Az, A3). So if f is the linear interpolant of data values at the vertices of 


K (f(v;) = y;), then fis the linear interpolant of those same values at the vertices 


of K f@ i) = y;). The Lagrange interpolation points are, in fact, constant in terms 
of the barycentric coordinates. The vertices have barycentric coordinates 
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(1,0, 0), (0, 1,0), (0,0, 1). 


The barycentric coordinates of the midpoints for quadratic Lagrange interpolation 
are 


The barycentric coordinates for degree d Lagrange interpolation are 


ee ae 
oo eee). pp twee Cee 
da d 


1 1 
The centroid of any triangle has barycentric coordinates (~, =, =). 


The barycentric coordinates can be represented for the given reference triangle 
which can be explicitly represented in terms of ¥ = (x, y): 


pe ee 
Las 
co 


While function values on the original and reference elements can be related 
through (4.3.6), gradients, and Hessian matrices are a little more complex: 


(4.3.7) ARV E(x) = VE@), 
(4.3.8) A Hess f(x) Ax = Hess fF), 


where x = Axx + bx = T x(x). Integrals are also transformed: 


(4.3.9) [pee ae = ot Ax [ Fema. 
K K 


In computational practice, many quantities can be pre-computed on the reference 
triangle, and the corresponding quantity on the original triangle can be computed 
quickly. 

Some important quantities, such as Lebesgue numbers for polynomial interpo- 
lation, are invariant under affine transformations (see Section 4.1.2). Interpolation 
schemes on a reference triangle K are transformed to interpolation schemes on a 
given triangle K by an affine transformation T x; if P is the family of interpolation 
functions on K then the interpolation functions P on K are given by p = po ry 
for p € P. Since polynomials under affine transformations are still polynomials of 
the same degree, if P = Py.q then P =P. q as well. The interpolation points ¥; € K 
are transformed to interpolation points x; = T x (x;) € K. 
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Table 4.3.1 Lebesgue numbers of Lagrange interpolation of degree k in a triangle 
k 1 2 3 4 5 6 7 
Ak2 1 1.67 2.26 3.47 5.23 8.48 14.33 


Theorem 4.11 /f A is the Lebesgue number for interpolation on K with interpo- 
lation points X¥;, i = 1,2,..., N, using interpolation functions from P, and A is 
the Lebesgue number of interpolation on K with interpolation points x; = T x (X;) 
using interpolation functions P = Po r. then K = A. 


Proof The Lagrange interpolation functions on K are given by Li eP where 
Li(®;) =1 if i = j and zero ea gtk On the other hand L; = 1; are eP 
and L;(x;) = L; ExG)) = = da (Te (Tx(X;))) = = ZL; j= = 1ifi = j and zero if 
i~¢j. Thus L,=L;oT;, are the Lagrange interpolation functions on K. There- 
fore, 


N 
A= max) |Li(x)| = max) |Li(T x (X))| = a xD Li@)| aA, 


as we wanted. 


Lebesgue numbers for Lagrange interpolation on a triangle for different degrees k are 
given in [26] and shown in Table 4.3.1. Note that the Lebesgue number A,» increases 
with k, but not so dramatically that seventh-order interpolation is not useful. 

Just as in one dimension, equally spaced interpolation is not optimal when the 
degree is high. Bos [26] gives alternative interpolation points for triangles that have 
better Lebesgue numbers. 


4.3.1.3 /\Error Estimates 
As in the one-dimensional case, the error estimates will depend on the size of the 
triangles. We can measure the size of each triangle or other shape by means of the 


diameter 


(4.3.10) hx = diam(K) = max ||x — yll,. 
x,yek 


In two and higher dimensions, however, this is not the only quantity that is important. 
Triangles can be long and thin, or relatively “chunky”. For convex K, one way to 
measure this is to look at the radius of the largest ball that can fit in K: 


(4.3.11) Px = sup{p | there is a ball B(x, p) C K forsome x € K}. 
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Brenner and Scott [31, p. 99] define a chunkiness parameter yx :=hx/px = 1. 
Note that smaller yx means “chunkier” K. Thin but long triangles have px small 
in comparison to hx making yx large. Another measure of “chunkiness” is Vx := 
hx /vola(K) '/d for K C R¢. Here “voly(R)” is the d-dimensional volume of a region 
R; for d = 2 this is just the area of R, and for d = | it is just the total length of R. 

For measuring the size and smoothness of functions on a region Q C R¢ we use 
Ww”? (2), or Sobolev, norms and semi-norms; m indicates the degree of differentia- 
bility, and 1 < p < o the exponent used: in terms of multi-indexes, 


lal ¢ |? 1/p 
ad 
4.3.12 m.p(Q) = , 
(4.3.12) If lverey =| Oo Lie 
1/p 

alel ¢\? 
4.3.13 m.p(Q) = 
(43.13) wer) Lie 
Note that 


m I/p 
P 
II F ll wm.r (ay = bp Mane : 
k=0 


If p = 00 we use 


lel f 
4.3.14 m.00(Q) = 
( ) Il Fllwm.0cq) = max eaen| One 

olol f 
4.3.15 = , 
( ) Iflw (Q) a Aine Ox «) 


It should be noted that the W”’? (QQ) semi-norm is equivalent to the semi-norm 


l/p 
fr if |p" re]? ax] 


in the sense of (1.5.2). 

We want to bound the interpolation error f — Z f. This error is zero if f is itself 
an interpolant, thatis, Z f = f. Each interpolation scheme has a set of possible inter- 
polants P = range(Z). Linear Lagrange interpolation is exact for all linear functions, 
quadratic Lagrange interpolation is exact for all quadratic functions, while Hermite 
interpolation is exact for all cubic functions. We suppose that this interpolation oper- 
ator is exact for all polynomials of degree < k. As k becomes larger, the interpolation 
operator is generally more accurate. As noted with respect to the Runge phenomenon 
(Section 4.1.1.13), it is more important to reduce spacing (ix ) than it is to increase 
the degree (k). We assume that P, C P where P; is the set of polynomials of x of 
degree < k. 
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To obtain these estimates, we start with estimates of the interpolation error f= 7 f 
on the reference triangle K. By pulling the interpolation data from the original triangle 
K back to the reference triangle K, doing the interpolation on K, and pushing the 
resulting interpolant back to K , we can get bounds for interpolation error f — Z f on 
any given triangle K. We need to consider f for which we can do the interpolation. 
To guarantee that we can take the derivatives up to order £ at an interpolation point, 
we need | fl wnr(®) finite where m — £ > d/p (see [213, Sec. 6.4.6], for example), 
orm > lif p=o. 

Because Z is exact on PDP, we can use the Bramble—-Hilbert lemma (31, 
Thm. 4.3.8, p. 102] to give a bound 


m 


|F-TA ence SCC, m, p, 7) > Fle provided m > max(s, k + 1). 
j=k+1 


Now we use the affine scaling property to transfer this result to a triangle (or tetra- 
hedron or simplex) K C R“. Suppose that T x : K = K is the affine transformation 
T x(®) = Ag¥ + bg. If x = Tx (®) then ¥ = T;'(x) = Ag (x —b) = A;Z'x — 
Ax 'D. Given a function f: K — R we have a corresponding function f: K>R 
whete f® = f (x) for x = Tx (x). That is, f® = f(T x(%)) = f(Ax® + bx). 
Then the Sobolev semi-norms 


1/p 


a a 
lflwircky = A > “Loe dx 
a:|a|=j 
Note that if g(x) = g(x), then 
Og OX 
a= ae Be SAKE + be) 5 
J 
— SAS Oe 
= dag ise ®, 
so there is a constant C;(\a|, 7, d) where 
|a| \B| 
we Le} < cide, jd) [Agi om = Te 
BIB la 


This gives the bound on the Sobolev semi-norm for j < k, 
(4.3.16) If lwiecey < CoCi.d, p) |Ag' | Idet Axl"? [Flying - 


Applying the argument to T; IKK gives 
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l/p 


(4.3.17) [Flac < C24, p) ARI [det Ag |? [flyin - 


The interpolation estimate for K from the Bramble—Hilbert lemma implies that for 
r < k, 
|f ~ i eee < C3(r, m, P) LF levine e 


We can transform these bounds on K to K. Note that the interpolation operator Zx 
on K is given by Tx f(x) = Z f (X) where x = Tx (X). So 
-1y\\" ip |F_F 
If — Zk Flare) < Cor, p) Ag’ ||’ det Axl”? |F — ZF lynne 
< Ca(r,d, p) Ag!" Idet Axl” Ca(r, m, p) |f] 
< C2(r,d, p) |Ag' |" Idet Ax|/? C3(,m, p) 


WiH1LP(R) 


m 

z ; —1)1/ 
> Cod, p) |IAxIl det Ag |"? [fl wivcey 
j=k+1 

m 
— r - 
< Car, k,m, p,d) D> YAR ARIE flwiocey 
j=k+1 


< Car, km, pd) (AK So WAKE flwiecey- 
jok+l 


Here «(A) = ||Alj || A~'] is the condition number of A. 

Let hx be the diameter of K, which is the length of the longest edge of the triangle 
K. In two dimensions, Ax = [v2 — v1, v3 — v1], where vj, v2, v3 are the vertices 
of K,so 


Axle <lAglle = V/llv2— 9113 + Iles — 011 < V2 hk. 


For general d, ||Ax||2 < Jd hx. However, the condition number K2(Ax) can be 
arbitrarily large. To obtain a bound on the condition number, we need an additional 
condition that the triangle K is “well shaped”, at least according to the “chunkiness” 
parameters yx or Yx mentioned above. In two dimensions, “chunkiness” can be 
defined in terms of the minimum angle of the triangle K. 

In higher dimensions, “chunkiness” is perhaps most commonly defined in terms 
of having an upper bound on the ratio hx /px where px is the radius of the largest 
inscribed sphere of K. 

However “chunkiness” or “‘well-shaped” is defined, it ensures a bound K(Ax) < 
Kmax- Then we have a bound 


m 


(4.3.18) | f —Zflwencey S Cs(r, kim, pd) hn, > AE" fF lwiecy « 
j=k+1 
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Clearly then, for smooth f/f, the interpolation error in the Sobolev norm of W”?(K) 
is 
If -—Zfh wey = OG) ashe > 0 


for well-shaped triangles. Note, however, that in the case of r = 0, the factor 
K(Ax)’ = 1 no matter the value of K(Ax ). So, if we are concerned only with approx- 
imating function values, then the requirement of the triangle to be well shaped is 
unnecessary. But if we wish the derivatives of the interpolant Zx f to approximate 
derivatives of f, then the condition of being well shaped becomes necessary. As 
we will see in Section 6.3.2.2 on the finite element method for partial differential 
equations, approximating derivatives is exactly what is needed. 


4.3.2 Interpolation over Triangulations 


Triangulations give a way of dividing a region up into triangles in two dimensions, 
or tetrahedra in three dimensions, or simplices in higher dimensions. If we can 
interpolate using polynomials over each triangle, then we can create a piecewise 
polynomial interpolant over the entire region. However, we want the interpolant on 
each triangle to be consistent with the interpolant on the neighboring triangles so that 
the combined interpolant is at least continuous over the triangulated region (Figure 
4.3.4). 

A triangulation is not simply a union of non-overlapping triangles. The triangles 
must meet each other in a specific way: if T; and T> are two triangles then 7; M T) is 
either 


e empty, 
e acommon vertex of 7; and 7), or 
e acommon edge of 7; and 75. 


Note that the common edge must be an entire edge, not a partial edge, as shown in 
Figure 4.3.5. 

Simply interpolating in each triangle does not guarantee that the interpolant is 
continuous on each common edge. We want the interpolant on each side of acommon 
edge to be the same so that the overall interpolant is continuous. 

Consider piecewise linear interpolation; if the two values on a common edge are 
identical, then the interpolants on each triangles sharing the edge will match on that 
edge. For a pair of triangles that meet at a vertex, the values of the interpolants on 
the different triangles must also match. 

Since the values at interpolation points can be treated as independent quantities, 
these matching conditions imply that each vertex must be an interpolation point, and 
each edge must have two interpolation points. Piecewise linear interpolation on a 
triangulation then requires three interpolation points on each triangle, which must 
therefore be the vertices of each triangle. 
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Fig. 4.3.4 An example of a triangulation 


(A) Allowed contact (B) Disallowed 
contact 


Fig. 4.3.5 Allowed and disallowed contact between triangles in a triangulations 


Guaranteeing continuity across triangles is substantially harder in two or more 
dimensions than one dimension, because in one dimension the “join” between two 
pieces is a point. As long as that point is an interpolation point, continuity is guar- 
anteed across the pieces. However, two triangles typically meet at an edge. Two 
tetrahedra typically meet at a face. The values of the interpolants from each triangle 
must match, not just at the interpolation points, but also at every point of the common 
edge. For tetrahedra, the interpolants must match at every point of the common face. 

To ensure that this works on triangles we require the following principles: 


(1) vertices must be interpolation points; 
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(A) Quadratic interpolation on a triangu- (B) Hermite interpolation on a triangula- 
lation tion 


Fig. 4.3.6 Interpolation on a triangulation 


(2) the interpolation values on an edge uniquely determine the interpolant on that 
edge. 


Note that the second item implies that value of the interpolant on a triangle restricted 
to an edge must not depend on the interpolation values at any point not on that edge. 
For tetrahedra and higher dimensional simplices, the condition can be expressed 
more succinctly: for an m-dimensional simplex, the interpolation values on each 
k-dimensional face must uniquely specify the interpolant on that face. This implies 
that the interpolant on a face cannot depend on any interpolation value for any 
interpolation point not on that face. 

Piecewise quadratic interpolation can also be guaranteed to be continuous using 
the Lagrange nodal points shown in Figure 4.3.1 (middle figure). For triangles with 
a common vertex, we just need vertices to be interpolation points. For triangles with 
a common edge, the interpolants for each triangle must be quadratic on the edge, and 
so we need three interpolation points on each edge to ensure that these quadratics 
match. This requires six interpolation points: the three vertices plus one point on 
each edge. The edge midpoints are usually chosen. This means that the orientations 
of the triangles sharing an edge do not matter. 

This gives continuous piecewise quadratic interpolation, but the first derivatives 
are not continuous. More specifically, while the tangential derivatives along the edges 
match because the values on the shared edge match, the normal derivatives generally 
do not. 

Hermite interpolation across a triangulation is illustrated by Figure 4.3.6. 


4.3.2.1 Creating Interpolants with Continuous First Derivatives 
Creating interpolation systems that guarantee continuous first derivatives across tri- 


angles in a triangulation is a surprisingly tricky thing to do. Unlike in one dimension, 
increasing the degree of the polynomial inside the triangle also increases the number 
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Fig. 4.3.7 Argyris element 
(degree 5) 


of conditions needed to match first derivatives across the boundary. The simplest is 
the Argyris element illustrated in Figure 4.3.7. 

In Figure 4.3.7, the dot means interpolation of the value at that point, the small 
circle around a point means interpolating the first derivatives at that point, and the 
larger circle means interpolating the second derivatives at that point. The short per- 
pendicular line segments indicate interpolation of the normal derivative to the edge 
at the intersection point of the edge and the line segment. The interpolating poly- 


nomials have degree 5. The dimension of the space of degree 5 polynomials of two 
variables is ( : : = 21. This exactly matches the number of independent val- 
ues to be interpolated. At each vertex we interpolate the function value, two first 
derivatives (0 f/Ox, Of/Oy), and three second derivatives (07 f/Ox?, 0? f/Ox Oy, 
0? f/Oy*) giving six values interpolated for each vertex. This gives 18 interpolated 
values for the vertices, plus three more the normal derivative values at the midpoints 
of each edge gives a total of 21 values to interpolate. 

To see that Argyris element interpolants are continuous across edges, we note that 
on an edge the interpolating polynomial must have matching values, first and second 
derivatives, at the ends of the edge. The derivatives are, of course, scalar derivatives 
as along the edge we should consider tangential derivatives. This gives six values 
interpolated by a degree 5 polynomial in one variable. These six values are sufficient 
to uniquely specify the degree 5 polynomial on the edge. Since these six interpolated 
values are the same on both sides of the edge in question, the values of the interpolant 
must match across a common edge of two Argyris triangles. 

But to have continuous first derivatives across the boundary, we also need the 
normal derivatives to match on each sides of the edge. Each edge is straight, so if 
p(x) is degree 5 polynomial in two variables, on each edge Op/On(x) = nV p(x) 
is a polynomial of degree 5 — | = 4, since n is constant on each edge. The normal 
derivative Op/On is interpolated at each end of an edge, as is the first tangential 
derivative of Op/On at each end. Furthermore, since the Argyris element interpolates 
the normal derivative at the midpoint of each edge, we have five values to interpolate 
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(A) HCT element (B) HCT element piece 
(piecewise cubic) 


Fig. 4.3.8 Hsieh-Clough—Tocher (HCT) element 


on the edge for 0p/On. This means that Op/On is a uniquely specified polynomial 
of degree 4. Tangential derivatives of p(x) along the edge are uniquely specified as 
the values of p(x) on the edge are uniquely specified by the values and derivatives 
interpolated on that edge. Thus, the gradient V p(x) is uniquely specified on an edge 
by the values and derivatives interpolated on that edge. 

Combining these arguments, we see that the degree 5 Argyris element gives a 
piecewise polynomial interpolant on a triangulation that has continuous first deriva- 
tives. 

An alternative to Argyris-type elements are so-called macro-elements, such as the 
Hsieh—Clough—Tocher (HCT) element illustrated in Figure 4.3.8. 

Macro-elements do not give a polynomial interpolant over the interval, but rather 
a piecewise polynomial interpolant where the pieces are sub-triangles of the orig- 
inal triangle. The pieces of the HCT triangle are indicated by the dashed lines in 
Figure 4.3.8. The advantage is the degree of the polynomial in each piece can be less 
than if we were imposing the same conditions on a single polynomial. 

The HCT element is a piecewise cubic macro-element. The cubic polynomial 
on each piece has 10 coefficients, giving a total of 30 coefficients to specify the 
piecewise cubic interpolant. The cubic polynomial on each piece has to satisfy the 
values and first derivatives (Op /Ox, Op/Oy) at two vertices of the triangle. This gives 
six conditions for each piece, giving a total of 18 conditions overall. Each external 
edge has the normal derivative of the interpolant specified at the midpoint. This gives 
a total of 21 conditions on the interpolating cubics. The additional nine conditions 
come from the continuity conditions on the internal edges. From continuity of the 
interpolant values and the first derivatives of the interpolant, we need matching values 
and first derivatives at the centroid. This means we have two sets of three equations 
to ensure these match. The internal edges go from the centroid to the vertices of 
the triangle; along each internal edge the values and tangential derivatives of the 


4.3 Interpolation—Triangles and Triangulations 283 


interpolant match at the endpoints. Since the interpolant on each side of this internal 
edge is acubic polynomial, we see that both of these interpolants match on the internal 
edge. This further implies that the tangential derivatives match on an internal edge. 
To complete the construction, we need to use the remaining three degrees of freedom 
to match the normal derivatives on the internal edges. 

The normal derivative on an internal edge is a quadratic polynomial. These match 
at the endpoints of that internal edge as the gradients match at the endpoints of an 
internal edge. Since the normal derivative is a quadratic polynomial along an internal 
edge, we just need matching normal derivatives at one point at (say) the midpoint of 
the internal edge. This gives the extra three conditions needed to uniquely specify 
the cubic polynomials on each piece. 

If we try matching the external edges of an HCT element across element bound- 
aries, we note that the values and tangential derivatives match at the endpoints of the 
external edge. Since the cubic polynomial interpolants on each side of an external 
edge are cubic, they must match along the entire external edge. Because the first 
derivatives must match at the endpoints of an external edge, the normal derivatives 
must match as well. In addition, the normal derivatives must match at the midpoints 
of the external edge, by construction of the HCT element. 

Thus, the HCT element can also be used to create piecewise cubic interpolants 
that are continuously differentiable over triangulations. 


4.3.3 /\ Approximation Error over Triangulations 


If we have a smooth function f and a triangulation 7, we can interpolate f over 
Qn =U Keck, K CR¥%, the union of triangles of 7;,. We can define the interpolation 
operator Z;, f(x) = ia f@) where x = Tx(X) and x € K, for K is a triangle in J). 
There are many ways of defining T as we have seen: Lagrange interpolation, Hermite 
elements, Argyris and Hsieh-Clough—Tocher elements. Each of these interpolants 
Tt fis a piecewise polynomial of a certain degree. Provided the image of Z contains 
all polynomials of degree < k for some k, and the triangulation is “well shaped”, the 
order of approximation is given by 


If — Zn f lwo, = OG) If lm) 


provided m > max(r, k+ 1) andm > €+d/p (orm => £if p = ow) where Z;, uses 
the values and derivatives of f up to derivatives of order £. 

The condition that m > €+d/p is necessary to ensure that the function and 
derivative values up to order £ of f can be bounded in terms of || fll ym.o(a,)- 

If we have a function f €¢ W**+!-?(Q) then we cannot necessarily even define the 
interpolant Z;, f. However, we can approximate f by an interpolant Z;, g for some g. 
We can construct a suitable function g as a convolution of an extension of f to R4 
with a smooth kernel function. 
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Fig. 4.3.9 Corner vs. cusp 


The extension of f beyond Q can be achieved through an extension operator 
E: W™?(Q) > WP? (R4) provided Q C R4 has a boundary that is locally the graph 
of a Lipschitz continuous function [31, 237]. This condition on the boundary means 
that re-entrant corners are allowed, but not “cusps”. The distinction is illustrated in 
Figure 4.3.9. 

The extension operator € is a bounded operator in the sense that there is a constant 
C where ||E f | wove) SC Il f llw=.e ay for all f. Note that the extension € f(x) = 
Ff (x) for all x € Q, so this constant C must be at least one. Since € f(x) = f(x) for 
allx € Q,€ f cannot be smoother than f. To handle this we consider the convolution 
Ef * ws where ws is a smooth function (derivatives of all orders are continuous) that 
has compact support (that is, ¢,(z) = 0 for ||z|l2 => R(s)). We use the parameter 
s > Oas ascaling parameter. More specifically, 


Ws(z) = 84 W(z/s). 


The R(s) above can then be given as R(1) s. We also assume that i= w(z)dz = 1. 
Then fyi %s(z) dz = 1 for any s # 0. The other condition that we need for the w 
function is that it has zero moments of order up to k: 


(4.3.19) / z*v(z)dz=0 forall a where 0 < |a| < k. 
Rd 


This zero moments property ensures that the convolutions € f * ~, converge rapidly 
to € f as s | 0. This is straightforward to show for p = 2 via Fourier transforms: 
w*t!- PR?) = W**!?(R%), The moment conditions (4.3.19) and fa. W(z) dz = 1 
imply that 

Fug) =14+ OCG") asg > 0. 


The reason is that 
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gle 
0=Flz% 22 ¥@]Q= i ea FOO) for 0 < |a| <k. 


As Fvw(¢) is analytic in ¢, we can use multivariate Taylor series, and Fy(0) = 
Sea W(Z) az = 1. Then provided € f € W**t!7(R¢) = H**1(R4) andr <k +1, 


IEF — EF * bsp, = my i, “ISIS IF EFI ® — FES # Ws OP dé 


= (2a)4 i; ; Ell’ WW — Fu (PIF (EFI? dé 


< max Wel 7 [1 — Fab (OP x 


(2n)~4 [ Nel FEAL @P ae 


= max [EI P 1 FUSE)? IE Sliveni2ceey 
Setting 7 = s&, we see that 


max Nese? [1 — Fu? = max In/sgo? [1 — Fea)? 


= 7) max gl? I Fee 


We need Fup(7) = 1+ O(||7| os to ensure that the maximum exists forO <r < k. 
Summarizing, 


IEf —Ef *Wslwraey S Oe) IE flier = OG) II f llert2@y » 


as s | 0. On the other hand, if r > k + 1 we have 


k+1- 
IE f ok Ws | wr2qRe) = O(s a ") II f lez) . 


This bound can be extended to other p from one to oo [237]. 

We can apply the interpolation operator Z;, to € f * 7); as it belongs to W”"? (R“) 
and thus W”:? (<2) where m > € + d/p. This gives an approximation to f over Q, 
and we can estimate the error in the approximation: 


If —Zn(Ef * vs lwre(ay < If — Ef * vslwre(ay + IEF * Us — Tn(Ef * Us wep (ay 
m 
< OCI") I flat rgqy tO) DO WES *dslyi.e@ 
j=k+l 
m . : 
= OFT) flere tOM DO bets IS | fllyetiogy- 
j=kt+l 
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If we take s = h, or even just within a constant factor, we get 


(4.3.20) If —ZilEf * Vs)lweeay = O81) WF llyeie - 


This estimate assumes that the triangulation is “well shaped” with maxxez, K(Ax) < 
Kmax independent of h. If r = 0, this “well-shaped” condition on the triangulation 
TJ, can be dropped: we simply need to ensure that the maximum diameter of the 
triangles in 7}, is small to obtain small errors. These approximation error estimates 
are asymptotically optimal; we cannot expect to obtain asymptotically better bounds 
given the smoothness of the function being approximated and the degree of the 
piecewise polynomial approximation. 


4.3.4 Creating Triangulations 


Creating triangulations is an important, but often neglected task, in obtaining piece- 
wise polynomial approximations or solving partial differential equations. Rather 
than leave this issue unaddressed, we give a simple yet practical algorithm that gives 
“well-shaped” triangulations suitable for two- and three-dimensional problems by 
Persson and Strang [201]. This method involves some novel tools adapted from 
computational geometry, such as Voronoi diagrams and the associated Delaunay tri- 
angulations [70, Chaps. 7 & 9]. The Voronoi diagram and Delaunay triangulation 
for a set of N points in R* can be computed in O(N log N) time, while for N points 
in R? these can be computed in O(N”) time. 


The Voronoi diagram of a set of points {x;,¥2,...,xN}C R¢ is the collection 
of regions 
(4.3.21) Vj = {* |x —x,|, =n Il} 


That is, V; is the set of points in R?¢ closer to x; than any other point in {x;,x2,..., 
xy}. Because we use the Euclidean or 2-norm for measuring distance, the boundaries 
are piecewise linear: if ||x — a||, = ||x — b||2 then squaring and subtracting gives 
2(b—a)'’x = IID II5 - \la|5 = (b— a)" (b +a); itis the line perpendicular to b — a 
that intersects the line segment joining a and B at its midpoint. 

The Delaunay triangulation of {x ,,x2,...,xy} is a graph where the vertices 
are the points {x;,%2,...,Xy}, and there are edges x; ~ x, if V; and V; have a 
common edge (in two dimensions) or face (in three dimensions). In two dimensions, 
the Delaunay triangulation can be thought of as the “dual graph” to the Voronoi 
diagram (or perhaps, the boundaries of the V;’s): the Voronoi regions V; becomes 
the vertices of the Delaunay triangulation while the edges of the Voronoi diagrams 
correspond to the edges of the Delaunay triangulation, except that the edges x ; ~ x¢ 
of the Delaunay triangulation are perpendicular to the common edge of V; and Vy. 
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Fig. 4.3.10 Voronoi diagram and Delaunay triangulation for a set of points; the dashed lines show 
the Delaunay triangulation 


Efficient algorithms for computing Voronoi regions and Delaunay triangulations can 
be found in [70, Chap. 9]. An example of a Voronoi diagram and its associated 
Delaunay triangulation is shown in Figure 4.3.10. 

Some of the common borders between the regions of a Voronoi diagram are 
outside the figure. 

As can be seen from Figure 4.3.10, there is no guarantee that the triangles of a 
Delaunay triangulation are “well shaped”. 

The algorithm of Persson and Strang uses two functions: a signed “distance” 
function D: R¢ — R and a relative spacing function H: R¢ — R. The region Q = 
{ x € R¢ | D(x) <0 hs The diameter of a triangle K in the generated triangulation 
T should have diameter hx < ho H(x) for x € K. The value of ho represents the 
overall size of the triangles in the eventual triangulation. 

The Persson and Strang algorithm begins by creating a uniformly spaced grid of 
points x; over a bounding box [aj, b;] x [az, b2] x --- x [ag, bg]. The generated 
points x; where D(x ;) > €, for a specified «, > 0 are removed. The Delaunay 
triangulation of the remaining points is then constructed. Any triangle that has a 
centroid outside Q is removed. 

Once a set of points roughly defining the region Q have been created, they are 
adjusted to both better approximate the boundary of Q and to make the mesh “well 
shaped’. To make the triangulation “well shaped”, each length of the edges of each 
triangle is moved to reduce the discrepancy. This is done by creating a “force” that 
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is applied to each point. This force is evaluated by first evaluating H (x) for every x 
the midpoint of an edge of the triangulation. 

To make this more formal, for any edge e of the triangulation, ¢, is the length 
of edge e (€, = |x; — xx, where x; and x, are the endpoints of edge e), and 
Xe = (x; + x,)/2 be the midpoint of edge e. Leth, = H(x,). The target length for 
edge e is z = he Fecale pa, £7 / ay 1] i the sum over f ranging over all edges 
of the triangulation. The factor Fycatle is chosen to be about 1.2 for two-dimensional 
triangulations; the important issue is that Fycale > 1. The “force” applied to edge e 
has magnitude F, = max(¢, — ., 0), and in vector terms is Fy = Fe (xj — xx)/Le 
so that the force acts along the direction of the edge. These forces on edges are turned 
into forces on vertices: the force on vertex x ; is then 


Fj= >) F.- DD F.. 


e:start(e)=j e:end(e)=j 


The positions are updated x ; <- x; + At F;. This, of course, does not correspond 
to the true meaning of force which would involve updating momentum, but rather it 
is used to update positions. The value used for At is 0.2. 

Certain points x ; can be fixed, so that they are not updated using these “forces”. 
It is particularly helpful to fix the corners of , for example. 

To make the points give a better approximation to Q, points outside Q are moved 
closer to the boundary. This is done essentially by using an under-determined Newton 
methodx <— x — D(x) VD(x)/ ||VD(x) II3. using a finite difference approximation 
to VD(x). 

Once the vertices x ; are updated, the Delaunay triangulation for the new set of 
points is computed, and the update process given above is repeated. This happens 
until the change between the previous and updated x ;’s falls below a threshold. 

The signed distance function D(x) has to be created to represent 
Q ={x | D(x) < 0}. A circle of center a and radius r can be represented by 
D(x) = (x — a)" (x — a) —r*. Arectangle Q = (a, b) x (c, d) can be represented 
by D(x, y) = max(|x — (a +_b)/2| — (b — a)/2, ly — C+.d)/2| — d —0)/2). 
We can also combine functions to represent combined regions: if D; represents 
Q , while D2 represents Q2. Then 


D(x) = max(D,(x), D2(x)) represents Q) 9 Qz, 
D(x) = min(D,(x), D2(x)) represents Qy U Qa, 
D(x) = max(D,(x), —D2(x)) represents Q4\Q2. 


While these functions are not smooth, they are effective for the Persson—Strang 
algorithm. Furthermore, the places where the function D(x) is not smooth on the 
boundary are usually at a corner point of Q. Often these points are better represented 
as fixed points. Even for points on the boundary where D(x) is not smooth, the 
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Fig. 4.3.11 Triangulation 
from Persson-Strang 
algorithm 
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updates can still converge rapidly, although the theory supporting this argument is 
beyond the scope of this text. 

To give an example of the results of the Persson—Strang triangulation algorithm, 
Figure 4.3.11 shows the results for the set 


Q= { @, y) ly > O0&(x?+ y’ <lorO0<x< 1)} represented by 
D(x, y) = max(min(x? + y? — 1, y(y — J), x(@ +1). 


The Persson-Strang algorithm is far from perfect. It can go into infinite loops 
for several reasons. One is due to the fact that the triangulation is a discrete object 
computed from points, and so there are discontinuities where small changes in the 
point positions result in large discrete changes in the Delaunay triangulation. This 
introduces the possibility of oscillation between two triangulations: one triangulation 
results in a small change to the point positions, resulting in another triangulation. 
Then the new triangulation results in reversed point positions, resulting in the original 
triangulation. The combination of non-smooth but Lipschitz D(x) functions com- 
bined with finite difference approximations to V D(x) can result in bad performance 
in the resulting algorithm. Also, three-dimensional triangulations generated can have 
elements that are not “well shaped”. 

These problems can be fixed using a more sophisticated, and more complex algo- 
rithm: locking the triangulation while point positions “settle down’, and using exact 
values for V D(x) or a suitable generalization of the gradient can be used. 

In spite of these issues, the Persson—Strang algorithm is a good starting point for 
investigating triangulation methods. 


Exercises. 


(1) The six points in this triangle are not sufficient to interpolate a quadratic function. 
Why? 
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(2) 


(3) 


(4) 


(5) 


(6) 
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Aninterpolation method for cubic interpolation in R? requires 10 function values, 
but this is not sufficient. Show that the points below in a triangle are sufficient 
for cubic interpolation. 


The center point is at the centroid. The dotted lines are at a distance of 1/4 of 
the edge length from the nearest corner of the triangle. 

Give a Lagrange basis for cubic polynomials with the interpolation scheme of 
Exercise 2. That is, if xj, j = 1,2,..., 10 are the interpolation points, find 
cubic polynomials ¢;(x), j = 1,2,..., 10, where ¢;(x,) = lif j = k and zero 
otherwise. It may be simpler to first write the ¢;(x) in terms of barycentric 
coordinates (A;, A2, \3) with O < A; for all i and eS A; = 1. Symmetry can 
be used to reduce the amount of work to do this. 

The interpolation scheme in Exercise 2 can be used to interpolate on a single 
triangle, but should not be used to generate a piecewise cubic interpolation on a 
triangulation. Explain why. 

Here is a modification of the interpolation in Exercise 2. Explain why this inter- 
polation scheme can be used to generate a piecewise cubic interpolation on a 
triangulation. 


Estimate the Lebesgue constant for the interpolation method in Exercise 2 
over the triangle. Compare the number obtained against the Lebesgue con- 
stant for interpolation using the Lagrange points with barycentric coordinates 
(j/3, k/3, €/3) where j,k, €>Oandj+k+£=3. 
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(7) Use the Persson—Strang triangulation method to obtain a triangulation of a unit 
disk Q = { (x, y) | x? + y? < 1} with a target triangle diameter of 0.2, 0.1, 
and 0.05. Use these triangulations to obtain piecewise quadratic interpolants 
using the Lagrange points (see Figure 4.3.1) of f(x, y) = e**” sin(ax) — 
1/(1 + x* + y?). Plot the maximum error against the target triangle diameter. 

(8) The Hermite interpolation scheme on triangles (see Figure 4.3.2) requires first 
derivative information. Give a basis for this interpolation scheme on a refer- 
ence triangle K= {(x, y)|O<x,y&x-+y <1}. For a general triangle K 
with vertices vj, v2, v3 give formulas for the Hermite scheme on K using the 
affine transformation Tx (¥) = Axx + bx. [Note: p(x) = P(AR (x — bx)) so 
V p(x) = Ay’ Vp(Ag! (x — bx)).] 

(9) Give a basis for the Hsieh-Clough—Tocher or HCT element (Figure 4.3.8) on the 
reference triangle K= {(x,y)|O0<x,y&x+y <1}. Recall that the HCT 
element has continuous first derivatives and is piecewise cubic. The central point 
is the centroid of the triangle. Use symmetry where possible to reduce the amount 
of work needed. 

(10) Partly symmetric elements like the element below can cause trouble for trian- 
gulations. Show that it is not possible to use this interpolation scheme to obtain 
continuous interpolation in a pentagon of triangles as shown below. 


Note that the barycentric coordinates of the interpolation points are (1, 0, 0), 


2 2 2 
(0, 1, 0), (0, 0, 1), (3. i, 0), (0, 3? +), and (3, 0, 5). 


4.4 Interpolation—Radial Basis Functions 


Radial basis functions are functions of the form x +> y(||x — yll,) for some y. 
As we have seen, interpolation over two or three dimensions can be done using 
triangulations and interpolating over each triangle in a triangulation. However, trying 
to produce smooth interpolants is difficult: complex and high order methods are 
needed even to obtain continuous first order derivative interpolants in two dimensions. 
For dimensions higher than three, using triangulations with simplices becomes even 
more expensive: the d-dimensional hypercube [0, 1]“ must be decomposed into at 
least d! many d-dimensional simplices, with vertices at the vertices of the hypercube. 
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Also, the amount of data needed to perform such an interpolation becomes extremely 
large: to perform even a piecewise linear interpolant over the hypercube [0, 1]4 
requires 2¢ data points. 

Radial basis functions provide a way of performing high-dimensional interpola- 
tion without requiring vast amounts of data or huge computational cost. These ideas 
are developed in several books, such as Buhmann [37], Fasshauer [88], and Wend- 
land [258]. Suppose we have a set of n data points (x;, y;),i = 1,2,...,n, where 
x; € R@ and y; € R. We look for an interpolant of the data of the form 


(4.4.1) p(x) = > °c; (|x —x;|],). 
j=l 
To find the values of the c;’s, we need to solve the equations p(x;) = y; for i = 
1,2,...,n. That is, we need to solve the equations 
(4.4.2) yi = > ej oxi - x;],) fori =1,2,...,n. 
j=l 


This is clearly a square linear system of equations, and so can be solved, provided 
the matrix A := [p(x —X; I) lay F S12 n| is invertible. 

Some functions y: [0, co) > R are not suitable; constant functions are clearly 
a bad choice. Scaling can be an important issue here: if |x; —X; l, < 1 for alli 
and j, then p(x: —X; | ,) © y(0). While the function ¢ is not constant, it appears 
to be nearly constant, and so numerical troubles can be expected in this case. 

The placement of the interpolation points is also important. If |x; —X; I, < | 
for some i and j, then 


IIx; — xxllo — |x; — x,]|, < |]xj — xe], < lei — xxl + [xi — xy], 


and so y(||x; — xx|l2) © (| Xj — XE I) for all k. Thus, columns i and j of A will 
be nearly linearly dependent, making A ill-conditioned. 

Bearing these things in mind, what can we say about the choice of » that will 
make it possible to ensure that the matrix A above is invertible? Can we find some 
bounds on the size of A~!? An example where the theory is fairly straightforward 
is y(r) = exp(—ar?) for a constant a@ > 0. As discussed in Buhmann [37] this is in 
some ways the starting point for all other radial basis function methods. 

One way to show that A is invertible provided the interpolation points x;, j = 
1,2,...,m, are distinct is to show that the matrix A is positive definite. Clearly A is 
symmetric as aje = (|x; — xe||,) = y(||xe —x,||,) = aej. Let H(z) = l(lzll2), 
So D(x) = YVi=1 cj w(x — xj). Note that p can be written as a convolution 
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n n 
p(x) = |v > cj 5(-— x) | @) = w(x —z) Dj 6(z — xj) dz 
: R¢ 
j=l j=1 


=)ocjo@—x,), 
j=l 


where 6 is the Dirac-d function. Using the convolution property of Fourier transforms 
(A.2.8), 


n 


Fp) =F) F | Yc) 6 —x;) 


j=l 


= Fue) doce, 


= 
To show that the matrix A is positive definite, we need to look at 


n 


n 
T 
c Ac= a Cjajece = > area —XxXe)ce 


j=l j=l 


=) cj pj) =| > ¢6¢-x;)), P| =@, PD). 
j=l j=l 


where q = Yvi=1 cj 6(- — xj) and (f, g) = fou f (Z) g(z) dz is the usual inner prod- 
uct for functions. Note that Fq(€) = Va Ge §"x)_ The inner product property 


of the Fourier transform (A.2.6) is that (f, g) = (2n)~¢ (Ff, Fg), so 
Te soe -d 
c’ Ac = (27) (Faq, Fp) 


= (2n)4 i Fa® Fue) Fale aé 
= (2n)~4 i) FU@) Fale a6. 


Now 
7 


d/2 ‘ 
Fus@) = (=) exp lig (Gay) > 0 


a 


for all €. Thus if c 4 0, c’ Ac > O and so A is positive definite, and therefore invert- 
ible. 
We can estimate the Lebesgue numbers for this interpolation method: the cor- 


responding Lagrange functions are Ly (x) = Le 1 oe w(x — x;) where c™ solves 
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the linear system Ac“ = e; where e; is the kth unit basis vector. So the Lebesgue 
number A can be bounded as 


A= mx > Hats} max) Ye" | ee = x)|<>- le 
k=1 


k=1 j=1 


|W(z)| < 1 for all z. Note that ce“ is the kth column of A~! and so A < 

Yi rer (Ae 

Lebesgue numbers are not the end of the story about how well this works to 
approximate functions. To start with, if a@ — oo in y(r) = exp(—ar?), then the 
matrix A goes to the n x n identity matrix. This gives A © 1 which seems ideal, 
but as shown in Theorem 4.7, we also need good approximation properties for our 
family of interpolants to get small interpolation error. 

However, large a > 0 does not give good interpolants or approximations as 
p(x —Xxj |) = exp(—a |x —X; |) drops off rapidly as |x —Xxj |, increases for 
large a. We need to reduce a so that the influence of an interpolation point x ; extends 
significantly beyond the closest distinct interpolation point. On the other hand, taking 
a | 0 gives | Aq! | — oo. Finding the optimal value of a depends on the spacing of 
the points x; and their placement in R@. 


Exercises. 


(1) Show that for distinct points ¥;, ¥2, ..., X, € R¢ the interpolation matrix A(q) 
with entries a;;(@) = exp(—a |x; —Xxj 3) goes to the identity matrix as a > 
+00, but goes to ee? as w | 0 where e € R" is the vector of ones. 

(2) Expand the previous exercise by using the Taylor series a;;(a) for small 
a> O0:aj(a)=l-a |x; - x ||; + O(a2). Show that |x; - x ||; = (be? + 
eb’ —2X7X);; where X =[x1, X2,..., xy], e=[1,1,..., 1]", and b= 
[x{ x1, X5X2,..., x) Xy]’. Give a first-order approximation to A(q) for a © 
0. Give an asymptotic approximation for A(q)~! in terms of X and b. 

(3) Use the radial basis function y(r) = exp(—ar?) to interpolate f(x, y) = 
e**Y sin(x) — 1/(1 + x? + y’) over the unit circle Q= {(x, y) | x?+y?<1}. 
Use grid points x = (i, j)h fori, j € Z in the unit circle with h = 0.1 for the 
interpolant. Plot the maximum error of the radial basis interpolant against a. 
Also plot the norm | A“! ll against a where a;; = exp(—a |x; —X;j I). What 
is the optimal a for the maximum error of the interpolant? 

(4) Repeat the previous exercise with interpolation points sampled from a uniform 
distribution t circle: set x, = (xx, yx) with both x, and y, sampled from the 
uniform distribution over [—1, +1], rejecting any x, outside the unit circle. Use 
314 interpolation points. 

(5) Consider finding the least squares approximation )>,¢7 cey(x — k) © 1 forall x. 
Assuming the solution is integer-translation invariant, set c, = co for all k € Z. 
The problem then becomes: minimize ia el —cC rez p(x — k))?dx. Show 
that the value of co that minimizes this integral is 


4.5 


(6) 
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1/2 
te Loner, P(x — k) dx 

TE as 

Lip (Deez PO — k))” dx 


Compute this integral numerically for y(z) = exp(—az’?) for a=2", 
m= -—5, —4, ..., +4, +5. For each value of a listed, compute numerically 
(eye (1 —co Dyez (x — &))2dx. What value(s) of a minimize the least 
squares error for approximating a constant function? 

Show that if 


13x)? (4 = 3x772). 0'< |x| = 15 
B(x) = 4 (2— |x|)3/4, 1 < |x| <2, 
0, 2< |x|, 


then FB(€) = (3/2) sin*(€/2)/(€/2)* and so FB(€) > 0 for all € € R. Show 
that p(x) = B(ax) for a > 0 can be used for radial basis function for inter- 
polation over R. In higher dimensions, show that p(x) := awe ChY(X — Xx) 
can be used to interpolate arbitrary values y,; = p(xx) provided all the inter- 
polation points x, € R¢ are distinct if y(z) = TTs=1 B(az,;). [Hint: Show that 
Fp(€) > 0 for all € € R¢.] Note that B is a cubic B-spline function. 


4.5 Approximating Functions by Polynomials 


How well can we approximate functions by polynomials? It depends, to some extent, 
by how we measure the size of the difference between two functions f, p: D> R. 
The most severe measure commonly used is the maximum error: 


If — Plloo = max | f(x) — p(x)|. 


Other measures include the least squares measure 


1/2 
lf — Pll = | f (f (x) - pos)? ds 


How well can we approximate f by a polynomial of degree < m? Can we make the 
error arbitrarily small (by taking m — oo)? 
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4.5.1 Weierstrass’ Theorem 


How well can continuous functions be approximated by polynomials? 
Weierstrass discovered the answer: as well as we please. 


Theorem 4.12 (Weierstrass approximation theorem) For any continuous function 
f: [a,b] > Rand « > 0 there is a polynomial p where 


max |f(x) — p(x)| Se. 


There are many different proofs of this result. One is given in Ralston and Rabi- 
nowitz [211] using Bernstein polynomials on [0, 1]. Note that if we prove the result 
on [0, 1], then we can use the fact that if p(t) is a polynomial approximation to 
tre f(a+(b—a)t) on [0, 1], then p(x) = p((x — a)/(b — a)) approximates f 
by the same amount on [a, b]: 


flat+t@—a)) perro =) 
—a 


max | f(x) — p(x)| = max 
asx<b O<t<l 


= max | f(a + 1(b— a)) — P(t)|. 


The proof in [211] uses Bernstein polynomials given by 


dun (t) = ({)ea a 


and the quasi-interpolant 


n k 
Palt) = Do fbn. 
k=0 


The accuracy of the approximation depends on how smooth or rough the function ff is. 
One measure of roughness is the modulus of continuity of a function f: [a,b] > R 
given by 


(4.5.1) w(d) = sup {| f@) — f)|: x, y € [a, b] and |x — y| < 6}. 


Note that the proof in [211] gives a bound on || f — pnll,5 = O(w(n7!/”)) as 
n — o. This is, in fact, rather pessimistic. If f is twice differentiable, then the error 
in the Bernstein quasi-interpolant is, in fact, || f — pp ||,, = O(n~!), compared to the 
estimate obtained above, which is || f — pnll,, = O(n7'/”) asin > on. 

If we allow ourselves to use other polynomials than Bernstein quasi-interpolants, 
we can obtain much better approximation estimates—Jackson’s theorem (Theo- 
rem 4.14) shows how to do this very well. But Weierstrass’ result is a valuable 
start in one dimension. 
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In more than one dimension, there is a very powerful generalization that applies 
not just to polynomial approximation, but also trigonometric polynomial and other 
kinds of approximation methods. That theorem is the Stone—Weierstrass theorem, 
and it is taught more often in topology than numerical analysis. Note that an algebra 
A is understood to be a vector space over R that has a multiplication operation that 
is consistent with the vector space structure. 


Theorem 4.13 (Stone—Weierstrass theorem) Suppose that A is an algebra of con- 
tinuous functions D — R with D a compact set that contains a non-zero constant 
function and separates points (that is, for any x # y € D there is a function s in A 
where s(x) # s(y)). Then for any continuous function f : D — Rand e > 0, there 
isa p € Awhere ||f — pil, <«¢ 


Since A is a subalgebra of all continuous functions D — R, this amounts to 
requiring that if f, g € A, then f - g € A. Examples of such algebras for D C R" 
include polynomials, trigonometric polynomials (x > cos(kx) or x b sin(kx) for 
k an integer), and linear combinations of exponentials x +> e°* with a € R. It can 
include algebras that are not natural, such as polynomials on [0, 1] with only even 
powers. 

Unlike our proof of Weierstrass’ theorem, most proofs of the Stone—Weierstrass 
theorem are heavily topological and give essentially no information about how rapidly 
the approximations approach a given function in terms of the “order” or “degree” 
of the approximating function. Nevertheless, the Stone—Weierstrass theorem is a 
powerful theorem that can justify different approximation schemes. 

For a proof of the Stone—Weierstrass theorem, see [242] or [184]. 


4.5.2. Jackson’s Theorem 


In 1911 Dunham Jackson, an American mathematician writing a PhD thesis at Gét- 
tingen, Germany, proved theorems connecting the smoothness of a function, and 
the accuracy of trigonometric polynomial approximations and ordinary polynomial 
approximations. These results and more were published in his book [133]. Sergei 
Bernstein proved a converse to Jackson’s main theorem in 1912, showing that the 
asymptotics of the error of the best polynomial approximation as the degree goes to 
infinity imply a certain degree of smoothness of the function. 


Theorem 4.14 (Jackson’s theorem) If f : [a,b] > Ris r times differentiable, then 
there is a constant C where for each positive integer n there is a trigonometric 
polynomial p, of degree n such that 


Cw(f™, 1/n) 


n" 


(4.5.2) max f(x) — Pn()] S 


where w is the modulus of continuity. 
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Jackson proved this result via Fourier series: start with 


a+b b-a = 
5 + 5 cos 0) =)" ag cos(ké). 


k=0 


76 


Note that cos(k@) is a polynomial in cos 9; in fact, cos(k0) = T;,(cos @) where T; is 
the Chebyshev polynomial of degree k (see (4.6.3)). We can approximate the infinite 
sum by a finite sum. But simply truncating the sum does not necessarily give good 
approximations for smooth /f. Instead, we first pick a cutoff function p: [0, 00) > R 
where y(u) = | for u < i y(u) = 0 for u > 1, and is smooth and non-negative. 


— 2? 
Then we can set 


a+b b—a n k 
Pn( ar we A= LG z pu cos(k@) 
2a 7 
v2) = Vn@ — 6) fc z : ° 2 “ cos 6')d0’ where 
0 
te tn ®) = +p) + +o) costts) 
9) Yn (A) = qe o LPG 7 cos(ké). 


Because ras Un (0) dé = 1 and yet w,,(0) — Orapidly asm — oo for 6 not a multiple 
of 27, the polynomials p, converge to f at a rapid rate that is determined by the 
smoothness of /. 


4.5.3 Approximating Functions on Rectangles and Cubes 


Creating approximations to functions on rectangles [a,,b;] x [a2, b2]x 
--+ xX fag, ba] C R¢ can be done using polynomials based on one-dimensional 
methods. To simplify the discussion, we will scale and shift each interval [a;, bj] 
to the interval [—1, +1] so that we focus on the hypercube [—1, +1]¢. We can use 
tensor product interpolation to create polynomial interpolants: let —1 < Xo < x; < 
+++ <X, < +1 be given one-dimensional interpolation points. Then for a function 
f: [—1, +1]¢ — R we interpolate the values yj, j,....i, = f Gi» Xins «++ Xiy). For 
notational convenience we let i = (ij, i2,..., iq) and X; = [%j,, Xj,,..., xi,]7. We 
can use Lagrange interpolation polynomials to define the interpolant: 


n 


d 
P(X1, X2,.-.,Xn) = PS Jit ,i2 geeey ia | | Lis), where 


i ,i2,..,ig=1 j=l 
n 


noes fl = 


x; —X 
e=1; e4i € 
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Note that the degree of p(x), ..., X¢) inx; is < n, so that the total degree is < dn. 
The Jackson theorems can be extended to this case using 


with an error O(n~"~°) provided 0” f/ Ox) is Hélder continuous of exponent a € 
(0, 1] for j’ = 1,2,...,d. Again, py(x1,..., xa) has degree < n in each x; giving a 
total degree of < dn. Note that the hidden constant in “O” can depend exponentially 
on the dimension d. 


4.6 Seeking the Best—Minimax Approximation 


Seeking the best approximation for a function f: D — R froma family of functions 
can be a difficult undertaking, but there are ways of doing this. This can be written 
as an optimization problem: Given f find p ¢ F where F is a given family of 
functions that minimizes || f — p|| according to a specific norm ||-||. For minimax 
approximation, we take our norm to be the maximum norm: 


If — Plloo = max | f(x) — p(x)|. 


We start by looking for ways to identify when we have found a minimizing p € F. 
For the case where F is the set of polynomials of degree <n in one variable and 
D = [a, b], then the answer is found in the Chebyshev equi-oscillation theorem. 


4.6.1 Chebyshev’s Equi-oscillation Theorem 


The derivation of Chebyshev’s equi-oscillation theorem starts with the recognition 
that any norm function and specifically the max-norm function ||-||,, is a convex 
function. The second starting point is that the family F of all polynomials of degree 
< nis a vector space. 

We will have two equi-oscillation theorems: a general one, applicable to general 
domains and families of approximating functions, and another specialized to the case 
of polynomial approximations of degree <n on an interval [a.b]. 

First we note that any norm is convex: for 0 < 6 < 1, 


[Ox + 1 — A)yll < l]6xll + 1 — Ayll = 4 |lxll + A — 4) Ilyl- 


The max-norm over a compact domain D C R?¢ is given by 
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=max|g(x)| = max oag(x). 
Wello.co = maxlg(x)| = max o g(x) 


We drop the subscript “D” from ||-||p... when it is clear from the context what D 
should be. 

This section uses material from Section 8.2. 

Since our task amounts to minimizing a convex function, we use Lemma 8.11 
to show that directional derivatives y’(x; d) exist for convex functions, and Theo- 
rem 8.12 to characterize a minimizer in terms of directional derivatives: y’(x; d) > 0 
for all d. 

Our objective function 


Pt> lf — Pllo = ma 
xeD,ao 


aed (f(x) — p(x)) 


can be written in the form maxyey~(p,y); for our problem, Y = 
{(x,0) |x € D, o € {+1}}. The set Y is a closed and bounded set in R’”t', so the 
maximum over Y exists. The function ¢=(p, (x, 7)) = 0 (f (x) — p(x)) is a smooth 
function of p, or equivalently, of the coefficients of p. Here is a generalization of 
Danskin’s theorem that helps us compute these directional derivatives: 


Theorem 4.15 Let w(z, y) be a function where 1(z, y) and V,W(Z, y) are both 
continuous in (Z, y). If Y isa closed and bounded subset of R", for z € R" we define 


p(z) = max w(z, y). 
yeY 
Then for any d € R", the directional derivative yp’ (z; d) exists and is given by 
gg d)= max V,ib(z, y)"d, 
yeYy*(z) 


where Y*(z)={y € Y | Vv, y) = v(z)}, the set of all y’s attaining the maximum 
for the given Zz 


Proof Consider the quotient (y(z + hd) — y(z))/h for h > 0. For each h > 0, 
p(z + hd) = v(z +hd, y(h)) for some y(h) € Y. Since Y is compact there is a 
convergent subsequence y(i,) > y* € Y andhy > Oask > ow. Now 


plz + hyd) — v(z) 
hy 
_— vG + hed, y(hy)) — Y&, y") 
= hh ; 
for some y* € Y where y(z) = w(z, y*) = wz, y) forall ye Y 
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= Wz + hed, ylhy)) — Wz, ylhx)) if Wz, y(hk)) — WZ, y*) 
hy hy 


< Vzw(z + cxhed, y(hy))'d for some 0 < c < 1. 


Note that by continuity of wy and vw, y(z + hyd) = WZ + hyd, y(hy)) > (Zz) = 
w(z, y*), so y* € Y*(z). By continuity of Vz, 


hid) — 
lim sup ves: o 2) < Vw(z, y*)'d 
k->o0o k 


< max V,w(z, y)'d. 
Re 2V(Z, y) 


On the other hand, for any y € Y*(z), forh > 0, 


ez + hd) ~~) ~ v@+hd, y)—~¥@, y) 
h ~ h 
> Viv(z,y)'d ash {0. 


So 


p(z + hd) — pz) ~ tia Vow. 


lim inf 
ho h = 


Thus, since lim sup;_,,((z + hed) — p(z))/he < Vidz, y*)"d, 


hyd) — 
lim ceo S02) = max V,wW(z, y)'d. 
k>oo hx yeY*(z) 


Since there is no subsequence for which the limit can be different, the directional 
derivative exists and is given by 


g'(z;d) = max Vzw(z, y)'d, 
yeY*(z) 


as we wanted. 


Our main theorem for the one-dimensional case is below. 


Theorem 4.16 (Chebyshev’s equi-oscillation theorem) Suppose f : [a,b] > R is 

continuous and p is a polynomial of degree <n. Then p minimizes || f — P|lo. ‘= 

max yefa,b} | f(x) — p(x)| if and only if there are n+ 2 points a <1) <t <th < 
» <t41 < b where 


(4.6.1) ft) —p@)=o(-1)' ||f-—pio, i=0,1,2,...,2+1 


whereo = +1. 
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Proof We can represent a polynomial p(x) of degree <n by means of coefficients 
(for example): p(x; ¢) = a c.x*. The max error is then 
If — PG loo = max | f(x) — ps ©)| 
xe[a,b] 
ie (f(s) — p(s; ¢)). 


sé[a,b], o=+ 


So let us set Y = {(s,0) | s € [a,b], o = £1} and v(c,s) = a (f(s) — p(s; ¢)), 
where s = (s, a). Then y(c) = || f — p(-; c)||,,. In what follows, we will assume 
that || f — p(-; ©)|l,, > 0; otherwise, f = p(-; c) and p(-; c) is clearly the minimax 
approximation and the equi-oscillation condition (4.6.1) is satisfied. 

So what we need to do now is to compute y’(c; d) for any given d and show that 
y'(e; d) => 0 for all d if and only if the equi-oscillation condition holds. 

First: note that Y*(c) is the set of (s,o) where o (f(s) — p(s;c)) = 
lf — pC; ©)|l,.. Any such s € [a, b] for which | f(s) — p(s; ¢)| = || f — pCi Olle 
we will call an equi-oscillation point. Now 


y(e;d)= max V,u(e,s,o)'d 
(s,c)EY*(e) 


n n 
a 
= max o (s) — cis’ | dj 
(s,a)EY*(c) 2 2 f > ‘ ; 


n 
= max a) (—s')d; = max —od(s), 
(s,o)EY*(c) = (s,o)EY*(c) 
= 


where d(s) = )-;_, djs‘. For an equi-oscillation point s, if (s, 7) € Y*(c), theno = 
sign (f(s) — p(s; c)). Now d(s) can be an arbitrary polynomial of degree < n. If 
there aren + 2 points ¢; satisfying the equi-oscillation condition (4.6.1), then in order 
to have y’(c; d) < 0 we need 


—sign (f (ti) — p(tis €)) dt) < 0 


as t; is an equi-oscillation point for each 7. Thus signd(t;) = —signd(t;+41) fori = 
0, 1, ..., 7 +1, andd(s) would have n + 1 roots in the interval [a, b]. This implies 
that d(s) = 0 and so we cannot have y’(c; d) < 0. 

On the other hand, suppose that y'(c; d) > 0 for all d. We then need to find the 
points a <t%<t) <t) <--- <t,4, <b satisfying (4.6.1). We need to deal with 
two sets of points in [a, b]: 


S,={s€l[a,b]| f)- posed =+If -—pGell,} and 
S_={se€[a,b]| f(s)— psc) =-IIf — pCi Olle}. 
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Clearly the union S$; U S_ is the set of equi-oscillation points, which is non-empty. 
Also, both sets Si are closed sets. 

Letup = min{s € [a,b] | s € S$; US_} = min(S, U S_). Thenuy € S;_ orug € 
S_ (exclusive). We let G € {41} be chosen so that uo € Ss, Ss being Sy or 
S_ according to whether ¢ = +1 or —1, respectively. Then let uj = min S_3, 
the first equi-oscillation point where the error has the opposite sign to the error 
at ug. Now define vp = max{s € S¢|s <u, }. We can continue defining uz = 
min { 5 € Seiyeg | 8 = Ug | and then vy_; = max { 8 € SKpyeig | 8 S ug | until 
one of the sets becomes empty: say {s € Sepymig | 8 = Un | = §. Then a < uo < 
Up < Uy < Vy < Un <-++< Um <b, and SsC U j.2j<mlu2j. v2j] and S_¢C 
U j.2;41<mlU2j+15 V2;41]. Also, sign( f (uz) — plug; €)) = (—1)*G. We can choose 
m points w; where vy, < wz < uz 4, andsetd(s) = ST (we — s) which is a poly- 
nomial of degree m. Let d(s) = )-j_) dis’, provided m <n. Then for s € [ug, vx], 
sign d(s) = a(—1)‘; for s € Siz, sign (f(s) — p(s; ¢)) = +c. Thus 


yi(e;d)= max —od(s) 
(s,o)EY*(c) 
= max(max —o d(s), max +a d(s)). 
sess seS_¢ 
But for s € Ss, sign — Gd(s) = —G?(—1)*/ = —1 for some integer j, while for s € 


S_z, sign + Gd(s) = +G7(—1)*/+! = —1. Thus y'(e; d) < 0 for this d if m <n. 
Thus we conclude that if y’(c; d) > 0 for all d, then we must have m > n + 1. 

Then we can take ft, = u;, fork = 0, 1, 2, ..., 2+ 1 as the points satisfying the 

equi-oscillation condition (4.6.1), as we wanted. 


Related to the Chebyshev equi-oscillation theorem is the theorem of de la Vallée- 
Poussin which gives a lower bound on the minimax approximation error. 


Theorem 4.17 (Theorem of de la Vallée-Poussin) Suppose f : [a,b] > R is con- 
tinuous and p is a polynomial of degree <n. Suppose there are n+ 2 points 
aA<t) <t) <t <-++ <tyy1 <b where 


(4.6.2) f@—-p@=o(-1'E;, i=0,1,2,..., 241 


where 0 = +1. Then for any polynomial q of degree <n, 


_ > min &£;. 
If loo = ,_4 ie atl 


Proof We prove this by contradiction. Suppose || f — g||,, < minj=o,1 
Suppose also the F; > 0 for all i, as otherwise there is nothing to prove. Then 


q(ti) — pti) = (SG) — pi) -— ft) -— a) 
= 0(-1)'E; — (fi) — q(t). 
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Since || f — ||. < minj=o,1,....41 Ej, it follows that sign(qg(¢;) — p(t;)) = o(-1)'. 
Thus g(t) — p(t) oscillates in sign n + 2 times on [a, b], and so g — p has at least 
n+ 1 roots. But p and q are polynomials of degree < n, so p — q is a polynomial of 
degree < n. Thus g — p = 0 which contradicts || f — ¢||,5 < minj=o,1,....n41 Ei. 


4.6.2 Chebyshev Polynomials and Interpolation 


Chebyshev polynomials are polynomials with an equi-oscillation property. Cheby- 
shev polynomials are usually defined implicitly by 


(4.6.3) T;(cos 0) = cos(ké), k=0,1,2,.... 


At first it can be hard to see why these should be polynomials. We can start with 
some examples: 


Ty(cos 0) = cos(00) = 1, 

T\ (cos 0) = cos(10) = cos 0, 

To(cos 0) = cos(20) = 2cos” 6 — 1, 
T3(cos 0) = cos(3) = 4cos* 6 — 3cos 6. 


That is, To(x) = 1, T)(x) = x, To(x) = 2x? — 1, and 73(x) = 4x? — 3x. We can 
prove the fact that 7; is a polynomial for k = 0, 1, 2, ... and provide a useful means 
of computing Chebyshev polynomials as follows: 


Ty1.1(cos 0) = cos((k + 1)0) = cos(k@) cos 6 — sin(k@) sin 0, 
Ty_1 (cos 0) = cos((k — 1)0) = cos(k@) cos 8 + sin(k@) sin 0. 
Adding gives 
Ty+41(cos 0) + T,_ (cos 8) = 2 cos(k@) cos 6 = 2 T; (cos @) cos 6. 


That is, 


Th 1(x) + Te, (x) = 2x T(x), or equivalently, 
(4.6.4) Tri(x) = 2x Tk (x) — Tr-1 (x). 


Since we already know that 7 and 7; are polynomials, we can use induction to see 
that 7; is a polynomial for k = 0, 1, 2,.... Furthermore, the degree of 7, is exactly 
k, and the leading coefficient of 7; (x) is max(1, oe). 

Figure 4.6.1 shows the Chebyshev polynomials 7;,(x) for k = 0, 1,2, 3,4 for 
—1 <x < +1. The equi-oscillation properties of these Chebyshev polynomials are 
clearly visible in Figure 4.6.1. 
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Fig. 4.6.1 Chebyshev 
polynomials for 
x €[-1,4+1] 


The equi-oscillation property of Chebyshev polynomials gives them a valuable 
minimax property. 


Theorem 4.18 The polynomial 2~* Ty. (x) = x**! — q(x) where q(x) minimizes 


max ie — q(x)| 
—l<x<+l1 


over all polynomials q of degree < k. 


Proof By Theorem 4.16, it is sufficient to find k + 2 points ¢; where i — q(ti) 
oscillate in sign and ler _ q(ti)| = Max_j<y<41 eo - q(x)|. First we show that 


i 


max err — q(x)| = ae. 
—I<x<+l 

For —1 < x < +1, wecan write x = cos 6 for some 9; then T+) (x) = Th+1 (cos 0) = 
cos((k + 1)@) € [—1, +1], and therefore = max_j<y<4) |x“! — g(x)| = 
max_)<,<41 2-k |T,41(x)| < 1. Setting x = +1 = cos0 shows that this bound is 
attained as T,4,(1) = 1. 

The equi-oscillation points we can choose are t; = —cos(i7/(k + 1)) for i = 
0,1,2,...,kK +1: 


tt) — g(t) = 2* Trai) = 2-*(-1)" cos((k + Dix/(k + 1) 
= (a1 ea. 


These are the k + 2 equi-oscillation points that we sought, and so qg achieves the 
minimum. 


If we consider the interpolation error for interpolating a function f by a polynomial 
p of degree < n, by (4.1.7), 
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Algorithm 57 Remez algorithm 
1 function remez(f, t, €, a,b) 


2 done < false 

3 while not done 

4 solve agt+ajtj 4 ant? bes + Ant! + ( 1 E = f(t) 

5 for ao,dj,..., ay, E 

6 let p(x) =ao ta,x +--+ +ayx" 

7 10, t1, +--+ thy, <_ local maxima of |f(x)— p(x)| on [a,b] 
8 if max; ||p(ti) — f(i)| —|El| < €: done < true; end if 

9 end while 

10 end function 


(n+1) 
f(x) — p@)= ee — Xo) +++ — Xn). 


If we want to interpolate over the interval [—1, +1], it is reasonable to focus on 
minimizing 
max = |(x — x0) +++ (% — Xn) 
—I<x<+1 


by choosing appropriate interpolation points. Noting that (x — x9) ---(x —X%)) = 
x"t! _ g(x) for some polynomial g of degree <n, we solve this minimization 
problem by setting x”*! — q(x) = 2-" T,41(x). What, then should the interpolation 
points be? They should be the roots of T,,41 (x), which are 


(i+ 5)0 


4.6.5 = 
( ) xX; = cos( pel 


), fori =0,1,2,...,n. 


For interpolation on a more general interval [a, b], we use 


a+b b-a (i+ 45)r 
4.6.6 = 
( ) xe 5 + 5 cos( aot 


Ds fori =0,1,2,...,n. 


4.6.3 Remez Algorithm 


While using Chebyshev interpolation points (4.6.6) often gives us a near-minimax 
approximation, how do we obtain an actual minimax approximation? Evgenii Remez 
gave an algorithm in [212] (1934) for computing minimax polynomial approxima- 
tions based on the Chebyshev equi-oscillation theorem (Theorem 4.16). See Algo- 
rithm 57. This is actually Remez’ second algorithm for minimax approximation. 
Difficulties in implementing this algorithm come from trying to find n + 2 local 
maximizers of | f(x) — p(x)|. Since the polynomial p(x) satisfies f(t;) — p(t) = 
(—1)'E for all i, we have oscillation of signs of f(x) — p(x), and so we can find 
n+ 2 local minimizers, with at least one between zeros of f(x) — p(x) adjacent 
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to each t;, or between a zero of f(x) — p(x) and an endpoint of the interval [a, b]. 
One of the local maximizers chosen in line 7 should be the global maximizer of 


I f(x) — p@)]. 


4.6.4 Minimax Approximation in Higher Dimensions 


In two or more dimensions, the Chebyshev equi-oscillation theorem (Theorem 4.16) 
does not hold, as there is no longer an ordering of points. In this more general setting, 
we consider the problem of finding a polynomial p of degree < k that minimizes 


(4.6.7) IF — Plloo = max |f(%) — p@)|. 


We assume that D is a closed and bounded set in R¢. 

If we follow the steps in our proof of Theorem 4.16, we noted that the func- 
tion p+ y(p) := || f — Pll. is a convex function which is defined in terms of a 
maximum of w(p, x, 0) := o(f (x) — p(x)) over (x, 0) € D x {+1}. Then apply- 
ing Theorem 4.15, we have a global minimizer at p if and only if for any polynomial 
q(x) of degree < k we have 


max w'(p,x,0;q)>0, — where 
(x,0)eY*(p) 


Y*(p) = { (x, 0) € Dx {£1} | o( f (x) — p(®)) = IIf — Plloo } - 


That is, p is a global minimizer if for any polynomial g(x) of degree < k we have 


max —oq(x)> 0. 
(x,o)EY*(p) 


We call the set {xe D | there is o = +1 where (x, 0) € Y*(p) } the equi-oscillation 
set for p over D. Given p, each element of the equi-oscillation set is an 
equi-oscillation point, and we can consider each equi-oscillation point x to be labeled 
by the sign o where (x, 0) € Y*(p). 


Example 4.19 As an example, consider the problem of minimizing || f — pl|,, 
where f(x,y) = xy and p ranges over all linear functions on D = [—1, +1). 
The solution is, in fact, p(x, y) = 0. The points (x, y,a) € D x {+1} are given by 
the four corners of D with o = +1 at (x, y) = (+1, +1) and (, y) = (-1, -1) 
while o = —1 at (x, y) = (+1, —1) and (x, y) = (—1, +1). No linear function 
q(x, y) can be negative at (x, y) = (+1, +1) and (x, y) = (—1, —1) and positive at 
(x, y) = (+1, —1) and (x, y) = (—1, +1). Thus p(x, y) = 0 is the minimax linear 
approximation to f(x, y) = xy. 

Another example is the minimax linear approximation to f(x, y) = x* + y* over 
D = [-1, +1)”, which is P(x, y) = 1. The maximum error occurs at the corners of 
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D and at (x, y) = (0, 0). Any linear function that is positive at the corners must be 
positive at the center also, so p must be the minimax linear approximation. 

4.6.4.1 Identifying and Finding Minimax Approximations over D 


Note that if 


X*(p):= {x | there is o € {+1} where (x, 0) € Y*(p) } 


is a set of interpolation points for degree k interpolation on D, then p is not aminimax 
approximation of degree < k. 

Another way of thinking about minimax approximation is that it is a kind of 
linear program: minimizing a linear function subject to linear equality and inequality 
constraints: 


(4.6.8) mine’z — subject to Az > b, 
Zz 


where the inequality “Az > b” is understood componentwise: (Az); > 5; for all i. 
Note that o( f(x) — p(x)) is a linear function of p. Our minimax approximation 
problem is then 

(4.6.9) 


min E _ subject to EF —o(f(x)— p(x) =>0 ~~ forallx € D, oe {+1}. 
P, 


The difficulty with (4.6.9) is that there are infinitely many constraints. We can replace 
D in (4.6.9) with a finite subset Dy for some parameter N > 0 representing the 
number of points in D chosen. The question then becomes the trade-off between 
accuracy and efficiency: larger N means Dy gives better approximations, but greater 
computational cost. If Dy x {+1} contains Y*(p*) for the minimax approximation 
p*, then the solution (4.6.9) will be p*. The task then becomes one of trying to 
estimate Y*(p*). If we choose Dy to be an interpolation set for degree k interpolation 
over D, then solving 
(4.6.10) 

me subject to E — o( f(x) — p(x))=>0 ~~ forallx € Dy, a € {+1} 


is equivalent to interpolating f(x) — p(x) = cE forx € Dy and appropriate choice 
of sign o for each x. 


Exercises. 


(1) Show that if g(x) is the minimax approximation to x"*! over —1 < x < +1 then 
xl — g(x) = 2" Trai (x). 

(2) Implement Remez’ second algorithm (Algorithm 57). The hardest line to imple- 
ment is line 7. One way of doing this is to evaluate e(t) := f(t) — p(t) atN +1 


4.6 


(3) 


(4) 


(5) 


(6) 


(7) 
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equally spaced points s, = a + k(b — a)/N in [a, b] with N > n. Local max- 
ima and local minima of e(t) can be identified by e(sz_1) < e(s) > (5x41) 
and e(s,_1) > e(s,) < e(s,41), respectively. Remember we want to alternate 
between local minima and local maxima. Comparing e(so) with e(s;) can tell 
you which to look for first. What to do if there are not enough or too many 
local maxima and local minima? Try adding arbitrary points if there are not 
enough, and removing some if there are too many. Initialize with the Chebyshev 
interpolant. Test at least on f(x) = e* over [0, 1] and f(x) = 1/( + x”) over 
[—5, +5] with various n. 

It is possible to use Newton’s method in some cases to find minimax polynomial 
approximations via Chebyshev’s equi-oscillation theorem. Here f: [a,b] > R 
is the function to be approximated, which we assume to be smooth. We wish 
to find a polynomial p(x) = ee a, x* with coefficients ao, a1, ..., Gy to 
be determined as well as E = || f — p||,, and the equi-oscillation points a < 
Zo <Z1<+++ < 2% < Zn41 <b. Count the number of unknowns (it should be 
2n + 4). The equations to be solved for are 79 = @, Zn41 = b, the equi-oscillation 
conditions f(z;) — p(zj) = o(—-1)/E for j = 0,1,2,...,n +1, and from the 
fact that the z;’s are local minimizers, f’(z;) — p'(zj;) =O for j =1,2,...,n. 
Count the number of equations. Set up the (square) system of nonlinear equa- 
tions. Compute the Jacobian matrix for this system and apply a guarded Newton’s 
method. Test this on f(x) = e* over [0, 1] andn = 3 andn = 4 using the Cheby- 
shev interpolant for p(x) and zj = $(a + b) — $(b—a) cos(jr/(n + 1)), j = 
0,1,2,...,2+ 1, as the starting point. 

Show that if f(x, y) = x y then the zero function is the linear minimax approx- 
imation to f over [—1, +1] x [—1, +1]. [Hint: The points where the minimax 
error occurs are the vertices of the square.] 

Suppose that D C R¢ is closed and bounded, f : D — R is continuous, and we 
are looking for a minimax approximation p to f in a finite-dimensional vector 
space P of continuous functions D > R. Call aset § C D an interpolation set 
for P if for any values v: S — R there is a unique interpolant g € P where 
q(x) = v(x) for all x € S. Show that if a candidate approximation p € P has 
the property that the equi-oscillation set for p is contained in an interpolation 
set for P, then p is not a minimax approximation to f. 

Suppose that P is the set of quadratic functions of two variables over a triangle D, 
f: D — Ris continuous, and the equi-oscillation set for p € P over D consists 
of the vertices of the triangle and a point in the interior of D. Show that p is not 
a minimax approximation for f over D. [Hint: Show that for any assignment 
of signs to the equi-oscillation points, there is a quadratic function that has the 
specified signs at the specified equi-oscillation points. ] 

Devise a linear program to identify if Y*(p) = {(x1,0 1), (%2,02), ---, 
(x;,0,-)} has the properties to imply p € P is a minimax approximation to 
ff: D — Rwhere P = span {¢), $2, ..., 6m}. [Hint: We want to know if there 
is ad € P where maxj=12,.., 0; d(x;) < 0. Write d = mee cjoj;. Inequali- 


ties v > 0; pa cj; (x;) for all i ensures that v > maxj=1,2,..,, 0; d(x;). To 
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wm 


(9) 


(10) 


(1) 
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ensure that the problem is bounded we can add inequalities —1 < c; < +1 for 
all 7. Now minimize v over all c;’s and satisfying the constraints. Show that 
v = maxj=).2,__, 0; d(x;) at the minimum. ] 

Chebyshev filters are often used in Electrical Engineering. These have some good 
properties in that they have little distortion within the desired band, but rapid 
dropoff outside the desired band. For example, a Chebyshev low-pass filter has 
the frequency response function 


1 


JI +2T, (w/w)? 


To implement a Chebyshev filter is to find a rational function F(z) with real 
coefficients where | F (iw)| = G(w) for all w; for a realizable filter (that is, one 
that can actually be built), we need all the roots of the denominator of F(z) to 
have negative real part. Assume that wo = 1. 


(a) Show that | F(z) = F(z) F(z) = F(z) F(—z) forz = iw since F is aratio- 
nal function with real coefficients. 

(b) If we write F(z) = P(z)/Q(z) with P and Q are polynomials with real coef- 
ficients, then the roots of Q(z) O(—z) are zeros of 1 + €? T)(—iz) Ty(+iz). 

(c) The zeros of 1 + €? (—1)” T,(+iz)’ are given by T,(+iz)= + (—1)@t)/? /e. 
The definition of 7,, is usually made via T,,(cos #) = cos(n@), so T, (3 (el? + 
e!)) = sce"? + ei"). Thus, we want to solve 5(e'"? + ei") = 

+(—1)"*)/? /e. Solve this equation for e’”’, and so find solutions z for 
1+ (-1)" T,(4iz)? = 0. 

(d) Find the zeros of Q(z) for n = 4. Let Q(z) = Tja1@ — z,) where z, are 
the roots of Q. 

(e) Since P(iw) P(—iw)/(Q(iw) Q(—iw)) = G(w)? for real w, compute P(z) 
forn = 4. 

(f) Compute F(z) = P(z)/Q(z). Plot | F(iw)| against w. 


GWw) = 


Consider the average absolute error J ‘ | f(x) — p(x)| dx. By differentiating 
this with respect the coefficients of p(x), give the equations for minimizing 
fe | f(x) — p(x)| dx over all polynomials p of degree < n. Apply this to the 


problem of minimizing c |e* — a| dx with respect to a. 

Padé approximation is about rational approximation of smooth functions using 
Taylor series. These have the form f(x) © p(x)/q(x) where p and q are poly- 
nomials of the appropriate degrees with g(0) = 1. If we choose deg p = m and 
deg g =n then we can match Taylor series of f up to xt”. Find the Padé 
approximation of f(x) = e* © p(x)/q(x) with deg p = degq = 3. Plot this 
Padé approximation against f(x) for |x| < 1. [Hint: Write f(x) g(x) = p(x) 
and expand as power series in x, using the fact that g(0) = 1. This gives a linear 
system of equations for the coefficients of both p(x) and qg(x).] 

Show that the derivative T/(cos@) =nsin(n@)/sin@. Also show that 
U, (cos #) = sin((n + 1)6)/ sin @ is a polynomial of degree n in cos # by show- 
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ing the recurrence U;,41(x) + Un—1(x) = 2x U, (x) with Up(x) = land U; (x) = 
2x. These polynomials U,, are called Chebyshev polynomials of the second kind. 


4.7 Seeking the Best—Least Squares 


If the objective function is quadratic, then setting the derivative to zero means solving 
a linear equation. This makes the task much easier computationally. And so we look 
at what that means for approximating functions using the 2-norm: 


1/2 
If -pls=| f (Fa) — pea | 


Squaring the 2-norm gives the integral of the square of f — p, which is a quadratic 
function of p. If we represent p as a linear combination of a finite number of basic 
functions, then we just need to solve a finite system of linear equations. Since we 
understand a great deal about solving linear systems of equations, this is a straight- 
forward way of approximating functions. 


4.7.1 Solving Least Squares 


For finite least squares problems that involves a finite sum of squares, instead of 
integrals, we have the normal equations (2.2.3) from Section 2.2.1: to minimize 
|| Ax — b||5 over x, we solve A? Ax = ATD. 

The corresponding equations for least squares approximations are given by writing 
p(x) in terms of basis functions: 


N 
p(x) = Ycidi (x), 
i=1 


where {¢; |i = 1,2,..., N} is a basis for our space of approximating functions 
P. Let d(x) := [61(x), ..., dv(x)]" be the vector of basis functions, so p(x) = 
c’ p(x). Then 


Ie) := [ (Fedo) eS iD (f(x) — p(x)? ax, 


which is to be minimized over c. Our objective function J is a convex function of c. 
We can compute the gradient of J by 
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d 
VJ(c)'d = —J(c+sd) 
ds 


s=0 


ao / (f(x) — (e+ sd)" (x)? dx 
ds D 


s=0 
d 
2 i (f (x) — (e+ sd)" b(x)) (fF (@) —(ce+sd)' p(x)) dx 
a9 i: (f(x) — (x) d” (x) dx 
T 
) [ (f(x) — e (x) (x) as | d. 


So the condition that VJ(c) = 0 means that 


(4.7.1) if p(x) out dx]e= [fe otwyar. 


These are the normal equations for least squares approximation of functions. We 
should note that the matrix 


(4.7.2) B= i: b(x) d(x)! dx 
D 


is symmetric and positive semi-definite. Symmetry is clear because the integrand is 
symmetric. To show that B is positive semi-definite, 


z’ Bz =) z! d(x) b(x) zdx = (z? b(x)) dx > 0. 
D D 


For B to be positive definite, we need z 4 0 implies that (z' d(x))? > 0 ona set of 
positive volume in D. This is the case if the interpolation functions are continuous, 
and {¢1, @2,..., dy} are linearly independent over D. 

If we pre-multiply (4.7.1) by c7, we obtain 


/ p(x)2dx = / (c" b(x))dx = / p(x) f(x) dx 
D D D 


1/2 1/2 
< / pos) dx / f(xy as| 
D D 


by the Cauchy—Schwarz inequality (A.1.5); dividing by [ f D p(x)? dx | '? and squar- 


ing gives 
[ poras = | f(x)? dx. 
D D 
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In fact, p is the orthogonal projection of f onto the approximation space P with 
respect to the inner product (f, g) = is FS (x) g(x) dx. 

As with the case of the normal equations for finite sums, we can solve (4.7.1) 
using LU or Cholesky factorization. 

Often the symbolic evaluation of the matrix and right-hand side of (4.7.1) is not 
possible. In such cases, we fall back on numerical evaluation of the integrals. Care 
must be taken with such an approach. If we use an integration method 


M 
iE w(x) dx ~ D\w; W(x)) 


j=l 


for approximating the integrals, then Equation (4.7.1) is replaced by 


M M 
a w; d(x d(x;)' |e= DS wj P(xj) f(x;). 


j=l j=l 


Pre-multiplying by c7 gives 


M M 

> wj(e7 (xj)? = D> wj ce" bs) f(X)) 

j=1 j=1 

’ ! ie 1/2 a 1/2 

<| dou Coa)’ | | dow fe)” 
j=l 


j=l 


by the Cauchy—Schwarz inequality (A.1.4) assuming that all weights w; > 0. Divid- 
1/2 
ing by be Wj (c? px) | and squaring gives 


M M 
Yi wie" be)” < Yo wy fa)” 
j=l j=l 


Since c7 d(x) = p(x), this gives a Wj p(x;)? < = Wj f (xj). Treating 
4 w; f(x iy as a discrete approximation to /; pt (x)? dx, we can see that this 
gives a bound on the approximation p(x), regardless of how well- or ill-conditioned 
the matrix B is. However, this result depends on choosing the integration points x ; 
to be the same for both computing YS w; p(x ;)p(x;)" and for computing the 
right-hand side YS w; P(x;) f(x;). 

A variation on this approach is to use weighted least squares, where a weighting 
function w: D — Ris used: w(x) > 0 forallx € D, and des w(x) dx is finite. Then 
we seek p in our approximation space that minimizes 
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(4.7.3) ,. w(x) (f (x) — p(x)) dx. 


The normal equations for weighted least squares approximation problem are then 


(4.7.4) | w(x) d(x) p(x)? as| c= / Ff (x) w(x) d(x) dx. 
D D 


The matrix for this system of equations is, again, symmetric and positive semi- 
definite; positive definite if {¢,, d2,..., dy} are linearly independent over D. The 
same methods for solving the weighted normal equations can be used as for the 
standard normal equations. Numerical integration methods can also be used: 


M 
[v@ve dx & > wj W(x;), 


i=l 


to give fully discrete methods for finding weighted least squares approximations. 


4.7.2 Orthogonal Polynomials 
For the weighted least squares problem of minimizing 
[ee re - payear 
over p € P it can be helpful to define the weighted inner product 
(f 8hn = f w0e) £02) aCe) dx 


for a positive weight function w(x) > 0 forallx € D except ona set of zero volume. 
The weighted least squares problem can then be expressed as 


min(f — p, f — P)w.- 
pEeP 
We consider the one-variable case: 
b 
fu =f w0s) fla) eto dx. 


If {¢1, d2,..., by} is a basis for P, then we can express the weighted normal equa- 
tions (4.7.4) as 
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N 
(4.7.5) Gr eiuGg = Gia, AAT, 
j=l 
for the coefficients c; where p(x) = ae c; bj (x). These equations are much easier 
to solve if the matrix b;; = (¢;, ;)w is a diagonal matrix, that is, if (¢;, dj)» = 0 

fori ~ j. In that case, c; = (f, di) w/(Gi, Oi) w- 

Orthogonal basis functions are also likely to give better conditioned bases than 
other bases, especially if they are scaled so that (¢;, ¢;)» = 1 for alli. If the approx- 
imation space P is a space of polynomials, then our orthogonal basis functions are 
orthogonal polynomials. In fact, for polynomials of one variable, there is additional 
algebraic structure that we can use. In fact, we say po, Pi, P2,..- is a family of 
orthogonal polynomials with respect to the inner product (-, -)» if 


© (Pi, Pj)w = Viti A j, and 
e deg p; = j for 7 = 0,1,2,.... 


Note that scaling each p; by anon-zero scale factor a; to give a; p; does not change 
these properties. So we can set po(x) = 1, or any other non-zero constant. Suppose 
we have created p; for j = 0,1, 2,..., k and we want to get p,i1. We can start with 
Uz41(x) = x*+! and apply the Gram—Schmidt process (Algorithm 14): 


k 


(Ux41, Dj) 
Dk+i = UK+1 »~ tp; a Pj: 
j> PjJw 


j=0 


Since the degree of p; is j <k +1 for j = 1,2,...,k, qx41 is a polynomial with 
leading term x*+! and so cannot be zero. We can then set Pe+i = Ox+19K+1 for some 
non-zero ax%+1 Chosen according to some other objective. 

We can define the following operation on functions: Xf (x) = x f(x). Then 


b 
(4.7.6) (Xf, 3)w = ; w(x) x f(x) g(x) dx = (f, X8)w- 


We will now show that (Xpx, pj)w = 0 for 7 = 0,1,2,...,k — 2. First, note that 
since deg p; = j, the polynomials { po, pi, ..., Px—1} form a basis for all polynomi- 
als of degree < k — 1.Since (px, pj)w = Oforall j < k,itfollows that (px, q)y = 0 
for all polynomials q with degg < k — 1. Now (Xpx, pj) = (px, Xpj;) by (4.7.6), 
but if 7 < k —2thendeg Xp; = j+1<k—1s0 (py, Xpj)w = 9. So if we apply 
the Gram—Schmidt process to Xp; we get 


k 


(Xpx, Pjw 
dk+i = Xp 
2 (pj, Piw °’ 
(XPk, Pkw (XPk, Pr-1)w 
= Xp i 


Pk Pk- 
(Pk; Pk)w (Pk-1, Pk-\)w 
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Table 4.7.1 Legendre polynomials 
k 0 1 |2 3 4 =) 
Py (x) x | 33x21) | 4(5x3 — 3x) | ¢(35x4 — 30x? +3) | 3(63x9 — 70x7 + 15x) 


— 


Setting przi = Acgke+1 NOW gives 

(4.7.7) Peri = OGk+1 = OKX PE — Pe PK — VkePk-1 

for suitable constants @, and 7. This can be re-written as 

(4.7.8) Prix) = (Axx — Be) pe (X) — Ye Pr-1), 

which is the general form of a three-term recurrence relation for orthogonal polyno- 
mials. An example of a three-term recurrence relation is the one for Chebyshev poly- 


nomials (4.6.4). These recurrence relations give efficient ways of computing pj;(x) 
for j = 0,1, 2,... once the values of a;, 3; and y; are known for j = 0, 1, 2,.... 


4.7.2.1 Legendre Polynomials 


An important example of orthogonal polynomials is the orthogonal polynomials for 
(f, Qu = ee S (x) g(x) dx. Up to a scale factor, these orthogonal polynomials are 
the Legendre polynomials, which we can write using a Rodriguez formula: 


1 d” 2 n 
(4.7.9) Pil) = = [(« 1) | , 
The first few Legendre polynomials are shown in Table 4.7.1. 

Some other examples of orthogonal polynomials are Chebyshev polynomials with 
the weight function w(x) = 1/1 — x? over the interval (—1, +1), Laguerre poly- 
nomials with weight function w(x) = e~* over the interval [0, 00), and Hermite 
polynomials with weight function w(x) =e* over the interval (—oo, +00). 
Rodrigues formulas for Laguerre and Hermite polynomials are 


x 


ed | 
Lir(x) = = (e ae aia and 
n! dx” 
d” 2 
H,(x) = (-1)"e* —(e*). 
dx 


4.7.3. Trigonometric Polynomials and Fourier Series 


Approximation of periodic functions f by trigonometric polynomials 
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N 


f(x) © ao + D> (ag cos(kx) + by sin(kx)) 
k=1 


is a classical topic in analysis, starting with Fourier’s Analytical Theory of Heat [93] 
(1822) with later work by Lagrange, Dirichlet, Fejér, Calderon, Zygmund, and many 
others. The standard approach is to minimize the error squared: 


2a 


min | (f(x) — py(x))’ dx, 
PnEPN Jo 


where Py = span {x +> cos(kx), x  sin(kx) |k =0,1,2,...,N}. Since 
Qn Qa 
/ cos(kx) cos(€x) dx = i sin(kx) sin(éx)dx =O  forallk 4 @, 
0 0 
2a 
and / cos(kx) sin(¢x)dx =0  forallk, @, 
0 
we have an orthogonal basis, from which we can easily compute the coefficients 
2a 
ay = (20)! f(x)dx, 
0 
Qn 
aq=n! f(x) cos(kx)dx, fork > 0, and 
0 
Qn 
b=a! f(x) sin(kx)dx, fork > 0. 
0 


However, simply using this least squares approximation results in significant over- 
shoot for jump discontinuities and does not guarantee convergence of py to f uni- 
formly for continuous f. Fejér proved [91] that if 


Pn(x) =ayo+ > (1 _ a4) (ag cos(kx) + by sin(kx)) 
= N+1 


then py — f uniformly as N > oo for any continuous f. This approach was 
extended by D. Jackson [133] to give Jackson’s theorem (see Section 4.5.2). 
Trigonometric polynomials can also be thought of in terms of complex exponen- 
tials. Since 
e’ =cos6+isind (i =VJ—1) 


we can also write 


1, : Las , ; 
cos) = xe" +e!) = Ree’? and sind = rl =¢' \=S Ine”, 
i 
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where Re z and Im z are the real and imaginary parts of a complex number z. Thus 
trigonometric polynomials of order N can be written in the form py(x) = co + 


iN cp e!*. Note that e~/** = e/* taking complex conjugates. 


4.7.3.1 Trigonometric Interpolation and the Discrete Fourier 
Transform 


Interpolation with trigonometric polynomials on equally spaced points is a natural 
task: if x, = 2ak/N fork =0,1,2,...,M—1 and y, are given values, then we 
seek c; where 


N-1 
(4.7.10) we een” fork SOND ND, 
j=0 
Note that since j and k are integers, e?" /N—-0/N — e~?mi jk — e2mi jk, 
The transformation ¢ +> y defined by yy = )7j29 ¢; e77'/*/" is called the dis- 
crete Fourier transform (DFT). It is clearly a linear transformation CN’ — C%. And 
it is invertible: 


1 N-1 1 N-1 N-1 
_ —2nike/N, te —2mi kN —2mi jk/N 
N = . Yk = N ? > — 
k=0 k=0 j=0 
1 N-1 N-1 
—2ni kl/N 2ni jk/N 
(4.7.11) a ea TEREDT ere EEN 2 
j=0  k=0 
But 
N-1 N-1 
Ye —2ri ke/N oom GRIN _ -y Qni (j-O)k/N __ =e 2ni (j- ory 
k=0 k=0 


Ife?" J—-9/N & | we can use the standard formula for summing a finite geometric 
series to get 


N- 


2ni (j-O/N\N _ Qni (j-£) 
(27! U-0/N)k oe ail ) aoe -1 =, 
e2ti (j-O)/N _ | e2ti (G-O/N _ | 


= 


| 


since 27(j — £) is an integer multiple of 27. Thus, forO< j,C<N-1, je 
implies e? V—9/" ¥% | and so a et! U-OK/N — 9, So the only non-zero term 
in the outer sum of (4.7.11) is the term where j = £. So 
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Nob Net 1 *a 
rr Cj em ke/N en jk/N _ ne l= —cN=c 
j= k=0 k=0 
That is, 
i —2ni ke/N 
(4.7.12) 15%, ~~ e Yes 
k=0 


which is the inverse discrete Fourier transform. 


4.7.3.2 The Fast Fourier Transform 


The obvious way of computing a discrete Fourier transform (DFT) (4.7.10) or inverse 
discrete Fourier transform (4.7.12) takes O(N’) floating point operations. The hid- 
den constant is more than two since complex addition requires two real additions, 
while complex multiplication takes four real multiplications and two real additions. 
However, it was discovered by Cooley and Tukey [56] (1965) that there is a general 
and fast way of carrying out a DFT on N data points in O(N log N) operations. 
This work was based on earlier work by Danielson and Lanczos [68] (1942), whose 
approach was pre-figured by unpublished work by C.F. Gauss. A similar method had 
been published by Yates [264] for some different transforms. 

The basic idea of the fast Fourier transform (FFT) can be seen best where N is a 
power of two, although other prime factors of N were used by Cooley and Tukey. To 
see how this works, we first consider the case where N is even and write N = 2M. 
Consider 


N-1 2M-1 
Ve = y Cj em Jk/N — y Cj ecm Jk/(Q2M) | 
j=0 j=0 


We split the sum into two parts: even j = 2 and odd j = 2€+ 1. Then 


2M-1 
a Cj ei jk/(2M) 
j=0 
M-1 
( Coy C271 2EK/OM) 4 oy) gti Qt+Dk/QM)) 


¢=0 
M-\ M-1 
= i eon tk/M ae em k/(2M) is Caeki en k/M 
¢=0 e=0 
The first sum is the DFT of (co, c2, . .., €2¢m—1)) while the second sum is the DFT of 


(C1, C3, .--,; C2m—1). We can also split the computation of y; into the cases where 0 < 
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Algorithm 58 Fast Fourier transform for N = 2”, decimation-in-time version 


1 function fft(c) 

2 N <length(c); M<—WN/2 
3 a< [coe |€=0,1,...,M—1]; b< [coe41 |€=0,1,...,M—1] 
4 if M>1 

5 u <fft(a); v <— fft(b) 

6 else 

7 u<a; v<b 

8 end 

9 for k=0,1,...,M-1 

10 Yk — ug + ee AM) yy 
11 Vek ie e2mik/(2M) yy, 
12 end 


13 return y 
14 end function 


k<M-—1and M <k <2M —1. Note that for0 <k < M—1, e274 M)/M — 


e2n fk/M TF we write the DFT of (co, c2,..., C2(M—1)) aS (Uo, U1, ..., Um—1) and the 
DFT of (cj, €3, ..-, Czm—1) aS (Vo, U1, ---, Uu—1) we have fork = 0, 1,2..., M—1: 
(4.7.13) Ye = Ug + 27H OM) y, 
(4.7.14) Yerm = Ue — e2TiK/OM) yy 


The operations (4.7.13), (4.7.14) were called a butterfly by Tukey. This gives a 
recursive algorithm for computing FFTs as shown in Algorithm 58. 

Note that Algorithm 58 is labeled as a “decimation in time” version. That is 
because the c,’s are split into two vectors depending on whether the index k is even 
or odd. There are versions where the split into even and odd indexes does not occur 
for the c,’s but for the y,’s. These versions are called “decimation in frequency” 
versions. As noted above, we can also create variants of the FFT algorithm to handle 
other prime factors of NV, allowing for general FFT algorithms. 

Inverse Fourier transforms can be computed by a variant of the FFT where the 
factors e?"*/C™) are replaced by the conjugates e~?"*/@”), and by dividing the 
result by N at the end of the computation. 


4.7.3.3 Lebesgue Numbers and Error Estimates 


The minimax approximation error obtained by trigonometric polynomials of order < 
N is given by the Jackson theorems to be O(N~"") if f is continuously differentiable 
m times (see Section 4.5.2). 

On the other hand, by estimating the Lebesgue numbers (see Section 4.1.2) for 
interpolation with trigonometric polynomials of order < N we can obtain bounds 
on the interpolation error. The Lagrange interpolation functions for equally spaced 
trigonometric polynomial interpolation on [0,27] with x; =27j/N, 
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j =0,1,2,..., NM — 1, can be shown to be 


sin(N (x — x;)/2) 
N sin((x — x;)/2)’ 


Lj(x) = j=0,1,2,...,N. 


The Lebesgue function SBA |L j(X) | can be bounded asymptotically by (2/7) InN 
as N — oo. Because of the slow growth of the Lebesgue numbers for equally spaced 
trigonometric polynomial interpolation, the interpolation error tends to be very good. 
Only for very rough functions do we expect the interpolation error to increase with 
N. 


4.7.4 Chebyshev Expansions 


Chebyshev polynomials are given by T,,(cos #) = cos(n@) forn = 0, 1,2, .... This 
makes Chebyshev polynomials orthogonal with respect to the weight function 
w(x) = 1/1 — x? over (—1, +1): 
+ Tin (%) Tn) * Tn(Cos 8) T,(cos 4) 
——————dx = sin 0 d@ 
- V1l—x? 0 v1 — cos? 6 


= - cos(m@) cos(n@) dé 
0 


Tr, ifm=n=0, 
= 47/2, ifm=nF0, 
0, otherwise. 


So for a given function f: [—1, +1] — R we have the Fourier series expansion 


f (cos 6) = > a, cos(k@), and we get 
k=0 
fe) = Do a Tea). 


k=0 


For f differentiable an arbitrary number of times, the coefficients a, go to zero 
asymptotically faster than any rational function. Since |7;(x)| < 1 forallk and—1 < 
x < +1, the Chebyshev expansion f (x) = Y-y-.9 ax Tk(x) converges uniformly and 
usually rapidly. 

Instead of using integrals to compute the coefficients az, we can use trigonometric 
interpolation on equally spaced points. This is equivalent to using the discrete Fourier 
transform to compute estimates of Fourier series coefficients. Applied to the function 
f@) =1/d+ (5x)”) on[—1, +1] (see the Runge phenomenon in Section 4.1.1.13), 
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Fig. 4.7.1 Chebyshev 10° 
expansion coefficients for 
f(x) = 1/(. + (5x)?) (even 
coefficients only, odd 
coefficients are zero) 
10° 


|x| 


491° 


10°15 
0 50 100 150 


we obtain Chebyshev coefficients. These coefficients decay exponentially in k, as 
we can see in Figure 4.7.1. 


Exercises. 


(1) Show that the three-term recurrence (4.7.8) for a family of orthogonal polyno- 
mials can be re-written as 


(4.7.15) X Pex) = CK Pri (x) + ag pe(X) + BE peg i(X). 


Further, show that if the orthogonal polynomials are normalized (that is, 
(pj, Pjw = 1 forall 7), then cy = by_1. 

(2) Show that if the three-term recurrence (4.7.15) holds with cy, = by_1, then the 
eigenvalues of 


ao bo 
bo ay by 
T, = by a2 
‘. Dy-1 
by-1 an 


are the zeros of Dy+1. 

(3) Use the Rodriguez formula (4.7.9) for the Legendre polynomials to show that 
they form a family of orthogonal polynomials with respect to the inner product 
(f,g= f (x) g(x) dx. [Hint: Use integration by parts, many times!] 

(4) Show that the coefficients a; of a Chebyshev expansion f(x) = Yo ag T(x) 
are given by a = (1/7) (ba f (cos 6) cos(k@) dé for k= 1,2,... and ay = 


(1/(2n)) fo" f (cos 0) dd. 


4.7 Seeking the Best—Least Squares 323 


(5) If f(x) = pg ae T(x) © aha aT; (x) show that we can estimate the coef- 
ficients a, by @ through the equations f (cos(j7/N)) = ea q, cos(kln/N). 
Show how we can compute the @ through a discrete Fourier transform of the 
data f(cos(jm/N)), j =0,1,2,...,2N —1. 

(6) Implement a function to evaluate as ax T,(x) that takes O(N) floating point 
operations. [Hint: Use the Chebyshev three-term recurrence relation. ] 

(7) Plot the size of the coefficients |a,| of the Chebyshev expansion f(x) = 
ae a, T(x) for —1 < x < +1 where f(x) = e’. Use a logarithmic scale for 
|a,|. Show that empirically we have |a,| ~ C r* for k large. Estimate the value 
of r from the empirical data. 

(8) Repeat the previous Exercise with f(x) = 1/(1 + (5x)7). 

(9) Show that if m divides N then a DFT of order N can be created by combining m 
DFT’s or order N/m just as the Cooley—Tukey creates a DFT of order N = 2” 
by combining two DFTs of order N/2 = 2”7!. 

(10) Compute orthogonal polynomials of degree < 4 for the inner product 


1 
(f.g)=- [ nOnonors 


(11) Let DFT(f), = eer e °nUk/N Ff; be the discrete Fourier transform applied to 
f. Show that DFT(f) o DFT(g) = DFT(f * g) where “o” is the Hadamard or 


669? 


componentwise product ((u o v), = ugug) and “x” is the cyclic discrete convo- 
: N-i 
lution ((f * 8) = oj20 fi 8k—-j mod N). 


Project 


Using a linear program solver, create software to compute the minimax polyno- 
mial approximation p(x, y) of degree < m for a given function f(x, y) over the 
standard triangle K= {(x, y)|O<x, y&x+y < 1}. In order to approximate the 
local maxima of | f(x, y) — p(x, y)| over (x, y) € K, evaluate | f(x, y) — p(x, y)| 
at many points in K. Test this method by applying it to f(x, y) = e** cos(x — 
5¥) (1+ y?) and m = 1, 2, 3, 4, 5. 

Avoid adding constraints for each of the many points used for evaluating 
| f(x, y) — p(x, y)|. This can greatly add to the cost of solving the linear program. 
Instead, add constraints 


(f(x,y) — pw, y)) <5 


for points (x,y) only when they are apparently relevant. Write p(x, y) = 
pBARy cj;(x, y) with basis functions ¢;, j = 0,1, 2,...,m — 1, making the c;’s 
the main unknowns in the linear program. The algorithm should maintain a set 
S = { (x, ve) |K = 1,2,...,0} Cc K. Ateach iteration, the method solves the lin- 
ear program 
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mins — subject to 
sxc 


m-1 


+ (Ft, ) — > 79) @e. ye) Ss, k=1,2,...,0, 
j=0 
m—1 


— (F te, ve) — Db; Xe, He) <5, k=1,2,...,n. 


j=0 


Initialize S to be an interpolation set and add points to S according to the values 
If (x, y) — p(x, y)| over (x,y) € K. 


Chapter 5 ®) 
Integration and Differentiation rie 


5.1 Integration via Interpolation 


For functions of one, two, or three variables, numerical integration is often done via 
interpolation. Error estimates for interpolation can be used to obtain error estimates 
for integration. In high dimensions, these approaches lose value as the amount of data 
needed to obtain a reasonable interpolant becomes exorbitant. But in low dimensions, 
and especially in dimension one, these approaches work very well. 


5.1.1 Rectangle, Trapezoidal and Simpson’s Rules 


Geometric intuition is useful in defining integrals as limits of areas of rectangles 
approximating the graph of the function to be integrated, as illustrated in Figure 5.1.1. 

The rectangle rule for estimating the integral, using the left-hand endpoint of each 
rectangle, is 


n—-1 


b 
(5.1.1) ‘ fx)dx © D> f(a) Gia — x). 


i=0 
The trapezoidal rule is 


n—1 


b 
1 
(5.1.2) / fQxdx~ > 5 [f@i + fG@i+)] Gir — x. 


i=0 


If, instead of evaluating the function at the left endpoint of each rectangle, we have 
the mid-point rule: 


© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 325 
D. E. Stewart, Numerical Analysis: A Graduate Course, CMS/CAIMS Books 
in Mathematics 4, https://doi.org/10.1007/978-3-031-08121-7_5 


326 5 Integration and Differentiation 


Fig. 5.1.1 Approximate 
integration via rectangles 
and trapezoids; rectangles in 
solid lines, trapezoids in 
dot-dashed lines 


1 LQ XB In-1| x 
a= Xo b= 


Algorithm 59 Rectangle rule for integration 
1 function rectangle(f, a, b,n) 
2 s<0O 

3 for i=0,1,2,...,n—1 

4 s<—st+ f(x) 

2: 

6 

7 


end for 
return s-(b—a)/n 
end function 


Algorithm 60 Trapezoidal rule for integration 
1 function trapezoidal(f,a, b,n) 


2 s<— (f(a) t+ f(b))/2; h<—(b—a)/n 
3 for i=1,2,...,n—1 

4 s<s+f(atih) 

5 end for 

6 return s-(b—a)/n 

7 end function 


Algorithm 61 Mid-point rule for integration 
1 function midpoint(f,a, b,n) 


2 s<0; h<(b—-a)/n 

3 for i=0,1,2,...,n—1 
4 s<ost fat (it 4h) 
2 end for 

6 return s-(b—a)/n 

7 end function 
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Fig. 5.1.2, Comparison of 10° : : 
basic methods for integrals Rectangle 
Trapezoidal 
10°} 
Mid-point 
S 
wi 
107197 Simpson's 
1975 A 
10° 10! 10° 10° 10+ 
n 
b n—-1 1 es Xie 
L L 
(5.1.3) f(xydx ~ D> =f ( ) C41 — i). 
a =n 2 2 


Pseudo-code for these methods is shown in Algorithms 59, 60, and 61. 

The error for each of these methods applied to : x Inx dx is shown in Figure 5.1.2. 
From these empirical results, we can estimate the slopes of the error curves on the 
log—log plot to estimate the exponent a in error ~ consth®: for the rectangle rule, 
a © 1, while for the trapezoidal and mid-point rules, a ~ 2, and for Simpson’s rule 
(to be discussed later) we obtain a ~ 4 if we avoid the bend near the end of the graph 
due to roundoff error. 


5.1.1.1 Error Analysis 


Each of the rectangle, trapezoidal, and mid-point rules can be analyzed by using 
polynomial interpolation theory. The rectangle and mid-point rules use piecewise 
constant interpolation, while the trapezoidal rule uses piecewise linear interpolation. 


Rectangle rule. The simplest of these is the rectangle rule. Using the interpolation 
error formula (4.1.7), 


FQ) — fi) = fer) — xi) with x3 S cri Sx, 


we can estimate the integration error e; on one piece [x;, x;+1]: 
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Xi+1 


e= f(x) dx — f (xi) i41 — xi) 


= Lf) = folds = f Ff" (Cx,i)(x = xi) dx 


Ai 


Xi+1 1 
= re) f (x — x;)dx = fC) 5 Gin — xi) 


for some c; € [x;, Xj+1] by the Generalized Mean Value Theorem, since x — x; does 
not change sign over (x;, X;+1). Thus the total error is 


n—-l n—1 
1 
e= Ye = Dm FC) 5 Gin — x)’, so 
i=0 i=0 


n—-1 


= 1 1 
| f’(ci)| 5 id uy’ < 5 max (| f’ (ci) | Gia — x) Sis — xi) 


i=0 i=0 


le| 


IA 


IA 


1 
— max lf’(o)| max |x;41 — xi| (Xn — Xo). 
2 a<c<b i 


If |x;41 — x;| < h for alli, we get 
1 y 
(5.1.4) lel < 5 max | f (c)| h(b-a)=O(h) ash>0. 


Most often we use equally-spaced evaluation points x;: x; =a +ih, withh = (b— 
a)/n. Then asymptotically, 


b 
(5.1.5) e~ ; (/ fic) ac) h= $f) —f(a))h ash>0. 


In either case, we write e = O(h), and say that the rectangle method is first order, 
confirming our empirical results from Figure 5.1.2. 


Trapezoidal rule. For the trapezoidal rule, we are using piecewise linear interpolation, 
and so for x; < x < xj;41 we have the linear interpolant p,;(x) of pii(xi) = f(x) 
and PiiQisi) = FS (%i41)- From (4.1.7), 


FT C83) 
2! 


f(x) — pri(x) = 


(x — x;)(x — x41) for some x; < Cy; < Xj41- 


The error for the piece [x;, x;+1] is 
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e=|  f@dx- / © pride = i © (£ (8) ~ pri) dx 


xi i i 


= ie f Cu Oy x(t — xigt) dx. 


Again (x — x;)(x — x41) does not change sign on the interval so by the Generalized 
Mean Value Theorem, 


Jee) 


Xi+1 
e= / (x — x;)(x —x;4;)dx for some c; € [x;, x;4,]. 


To evaluate ic (x — x;)(x — x;41) dx, wecanuseachange of variable: x = x; + 5h; 


with h; = xj41 — x; and0 <s < 1. With this change of variable, 


Xi+1 1 
/ Geis -|/ hess = tees 
x, 0 


i 


1 
1 
= mn | s(s —1)ds = —=h}. 
0 6 


Thus 
= pe) h. 
12 ' 
Summing the errors for each piece gives the total error: 
n-1 n—-1 1 1 n—-1 
= = _ fl : = —— ” : h3. 
e a, » gt ca hi aX! (ci) hi 


This enables us to obtain a bound on the total error: 


1 = fia w — 
lel < nd (ci)| h? < 5 max | f lee 


(5.1.6) 


1 ” 2 
75 max x | f (c)| h? (ba). 


For equally spaced evaluation points (h; = h for all i), we have the asymptotic 
estimate: 


b 
(5.1.7) er -5f/ f'(c)\dch? = 5 '@ — f(b) kh? ash 0. 


That is, e = O(h?) ash > 0. 
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Mid-point rule. A quick short-hand for thinking about this is that degree m interpo- 
lation with equal spacing h gives an error of O(h""*!); this error in the interpolant, 
integrated across an interval of length b — a, gives an error in the integral of O(h'"*'). 
This certainly works for the rectangle rule (m = 0) and the trapezoidal rule (m = 1). 
But the mid-point rule uses constant interpolation and yet appears to give an error of 
O(h?). Why? 

Using the standard formulas with poi (x) = f((x; + xi41)/2), the constant inter- 
polant on [x;, x;+1], we get 


a=] fdr- fp 


Xi 


_ [ ye oe Su | re 


Xi+1 “ F 
= i Fee — St 


)hi 


Ai 


But we cannot apply the Generalized Mean Value Theorem as x — (x; + xj41)/2 does 
change sign on [x;, x;41]. Even more importantly, Fis “Ny = (xj + .x141)/2) dx = 0, 
so 


X41 Xi+1 a c 
Meee ans fp ye - ae 
, . Xi+1 : 
= poets | (x — sax — 0. 


Instead of getting errors e ~ consth as expected using piecewise constant inter- 
polants, we get e ~ const h? as we can see empirically from Figure 5.1.2. The phe- 
nomenon of getting asymptotically better convergence rates than we expect from 
the derivation is called super-convergence. The mid-point rule on a single piece 
[x;, X;+1] is not only exact for constant functions, as we would expect by using a 
constant interpolant, but is also exact for linear functions because f’(x) constant 
implies e; = 0. 

Since the mid-point rule is exact for linear functions, we can consider linear 
interpolants p;;(x) that interpolate f at (x; + x;+41)/2 and one other point in the 
interval. We can also use the Hermite interpolant (4.1.17) h,,; that interpolates f at 
(x; + x;41)/2 and f’ at the same point. With the Hermite interpolant being exactly 
integrated, the error for the piece [x;, xj] is 


— / = tagenids 


Xi 


mee Peg) ie Xi + X41 


dx. 
_ 2! 2 
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Now we can apply the Generalized Mean Value Theorem to get 
a 
a=5r@ fe ae “ey dx forsome c; € [x;, xj41]. 


The integral fae (x — (x + x441)/ 2)*dx can be easily evaluated using a change of 
variables x = (x; + xj41)/2 + 5h; with -5 <s< +h: 


Xi+1 +1/2 1 
/ (6 (a tpi) (2 dx -{" (s hj)” hj ds = phi 
x, —1/2 


Then 


n—1 n—-1 
1 


a X35" dot LF preenn) so 


i=0 


n—1 


2 1 
(5.1.8) |e| < oj max | f’ (ci) | max h, Be 54g max lf’ (ci) | max hj (b—a). 
Asymptotically, for equally spaced evaluation points (h; = h for all i), 


1 ‘ " 2 1 7 / 2 
(5.1.9) e = (/ Fi (ac) h = (f')— f'@)h? ash 0. 


5.1.1.2 Simpson’s Rule 


Simpson’s rule starts by integrating the piecewise quadratic interpolation of the func- 
tion f(x). Because quadratic interpolation requires three data points, we use inter- 
polation on pieces [x2;, x2;+2]. The interpolation points are x9, x1, X2,...,%,. We 
need n to be even. We assume equal spacing h between interpolation points. On the 
piece [x2;, x2;+2], the quadratic interpolant is 


p2i(x) = f (xi) Lox) + f aii) Lie) + f (242) Loi (x) 


using the appropriate Lagrange interpolation functions (4.1.3). The Lagrange inter- 
polation functions satisfy L;,;(x2j+¢) = lif j = € and zero if j A ¢ for € = 0, 1, 2. 
We can write Lj ;(x) = Lj((x — x2;)/h) where L ;(€) = lif j = €andzeroif j ¢ ¢ 
for j, £ = 0, 1, 2. These functions are easier to compute: Lo(s) = (s — 1)(s — 2)/2, 
Ly(s) = s(2—s), and La(s) = s(s — 1)/2. Then 
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Algorithm 62 Simpson’s rule 


1 function simpson(f,a,b,n) // n assumed even 
2 h<(b—a)/n 

3 se fa@t+f) 

4 s<—s+4f(x1) 

5 for f= 1,2 ys; n—| 

6 S<—s+2 f(x) +4 f (x2i41) 

7 

8 

9 


end for 
return (h/3)-s 
end function 


X2j42 X2i42 X2i+2 2 
[ tonase [ paitdx = PY Peay) Liat as 
X9, x2, X2; 


%2i i j=0 


2 X2i42 
= ~~ fess) f Lji(x) dx 


2 
=D ferasnn f Lj(ds 


4 1 
f (xa) + 3 f Gait) + ic ‘ 


ll 
SS 
| es | 
Wile 


Thus 
b (n/2)-1 (n/2)-1 4 I 
iy f@xdx~ D> pri(x)dx= So) h Exe + 3 Fain) + 5 fee) 
@ i=0 i=0 


(n/2)—-1 (n/2)—1 
= (h/3) [foo +s > fOaw+2 0 fem) + f9. 


i=0 i=1 


Pseudo-code for Simpson’s rule is given in Algorithm 62. 
The natural approach to error analysis is to estimate the error on each piece: 


X2i4+2 X2i+2 
j= f(x) dx - i, D2,i(x) dx 


X2i X2i 


= i FG) pain) a 


X2i 


_ fe? Lexa) 
(5.1.10) - oe (a = aaa) (4 — arias) — ariaa) de. 


Note that like the case of the mid-point rule, we cannot apply the Generalized Mean 
Value Theorem as the polynomial part (x — x2;)(x — x2j+1)(* — x2;+2) changes sign 
on the interval [x2;, x2;+2]. Also like the mid-point rule, 
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*X2i+2 
(3.1.11) / (x — X2i)(X = X2i41)(% — X2i42) dx = 0 


X2j 


because x2;+; is the exact midpoint of [x2;, x2;42]. The empirical evidence from 
Figure 5.1.2 shows that the error is O(h*) instead of O(h3). We can find the reason 
as (5.1.10) shows that if f is a cubic polynomial, f’”’(c,,;) is constant and so (5.1.11) 
gives e; = 0. Let p3,;(x) be the cubic interpolant of f (x2;), f (xo;41),f (x242) and 
J’ (x2:41). AS Simpson’s rule is exact for cubic polynomials, 


X2i+42 


X2j42 h 
i P3,i(x) dx = S [f (xai) +4 f (ors) + f ri42)] = 1 p2,i(x) dx. 


X2i Xi 


Then the error for [x9;, x2;+2] is 


ge i: Gt = au Gide 


X2i 


Xap42 f® (Ci) 
~ / 4! 


(x = x24) (x — x0i41)? (x — X2142) dx 


2i 


(4) (C;) X2i-42 
= a 4! / (x — X24) (x =p) — X2j42) dx 
i X2i 


for some C; € [X2;, X2;42] by Generalized Mean Value Theorem 


ae aCe i 2 2 pens ies 95 
= a f s(s — Is — 2) ds = FS fOE)W = — FT SfOG@)h’. 
(n/2)-1 I 1 (n/2)—1 
= merase AE, h> = ——h> AE : 
e 2 Te a > f° @) 
This gives us a bound 
(5.1.12) le| < Lge max | f (@;)| = + ito ay max | fe); 
~ 90 207i : 180 a<c<b 
Asymptotically, 
(5.1.13) en int : * (© de= Lapa) — F")) 
a 180 J, 180 


5.1.2 Newton—Cotes Methods 


Newton—Cotes methods are integration methods based on interpolation with equally 
spaced points. These generalize the trapezoidal and Simpson’s rules. For interpolation 
polynomials of order k we need k + 1 points, so each piece [x;;, xj(i+1)] must have 
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k + 1 interpolation points. The interpolation points are therefore x; = a + j h where 
h = (b—a)/nandn is a multiple of k. The polynomial interpolant on [xj;, X4¢-+1)] 


1S 
k 


Pri(x) = D> f (eas) Lj a(x) 


j=0 


where L ;; is the jth Lagrange interpolation polynomial on piece i: Lj; (xpi+e) = 1 
if j = ¢ and zero if j A £. Using the assumption of equally spaced interpolation 
points, Lj ;(x) = Lj((x — xgi)/h) where L;(€) = 1 if j = @ and zero if j # ¢ for 
j,£=0,1,2,...,k. Again, L; is a polynomial of degree k. The integral 


XkG+1) Xk(i+1) k Xk(i+1) 
/ f(x) dx ~ | Pr, j(x) ax = ) fens) f L j,i (x) dx 
j=0 Xki 


Xki Xki 


k ke k 
=> 0 fGnr)h / Lj(s)ds =h )) Bj f (xe; 


0 


j=0 j=0 
where w ; = es L; (s) ds. Then 
b C/K)—V nxn ay (n/k)-1_ k 
/ fdx~ 7 Pri(x)dx=h D> > Gj fri). 
a i=0 Xhi i=0 j=0 


As with Simpson’s method, if k is even, then we get extra bump in the order of 
convergence as the method is also exact for polynomials of degree / + 1. The reason 
is the symmetry of the method: #; = wyz_;, so 


k k 
>> wjG - (&/2))*1 =0= i (s — (k/2))**1ds; 
i=0 ” 


since the method is already exact for all polynomials of degree < k, the method is 
also exact for all polynomials of degree < k + 1. Consequently, the error is O(h"'*7) 
rather than O(h**!) as is the case for k odd. 

Simpson’s method is already exact for polynomials of degree < 3 so its error is 
O(h*). To get errors that are asymptotically smaller, we need to go to k = 4 instead 
of just k = 3. Weights for some higher order Newton—Cotes are shown in Table 5.1.1. 

Unlike the trapezoidal and Simpson’s methods, the weights for higher order 
Newton—Cotes methods can be negative. This becomes an important issue for very 
higher order Newton—Cotes methods. As noted regarding the Runge phenomenon 
(see Section 4.1.1.13), it is better to reduce spacing before increasing the degree 
of the interpolant, especially with equally-spaced interpolation points. A theoretical 
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Table 5.1.1 Some higher order Newton-Cotes integration weights; note that wz_j; = Wj. 

k |j Wj k Wj k J Wj 

4/0 |14/45 8 O |3956/14175 10 | 0 | 80355 / 299376 
1 |64/45 1 | 23552 /14175 1 | 132875 /74844 
2 |8/15 2  |—3712/14175 2  |—80875 / 99792 

6 | 0 |41/140 3. | 41984 /14175 3. | 28375 /6237 
1 |54/35 4 | —3632/2835 4 | —24125/5544 
2 |27/140 5 | 89035 / 12474 
3 | 68/35 


result that drives this point home is the following, which connects the integration 
error to the minimax approximation error. 


Theorem 5.1 Jf f: D — Ris continuous with D C R‘, and the integration method 


N 
[we fends ~ Dw fee 
2 i=l 


is exact for all polynomials of degree < k, then the error in the integral is bounded 


above by 


N 
w(x)| dx + i| } min — Glleo- 
(/, ) Sopot) ms al 


Proof The integration error is 


and for any g € Px, 


N 
J w(x) f(x) dx — )*w; fi) 
Pp i=l 


= 


N 
i w(x) (f(x) — q(x)) dx — Y)w; (f (xi) — q(x) 
D i=l 
N 
= f wes) 1) — 402) de + hw LF) — a) 
D 


i=1 


N 
< iS Iw(x)l dx WF —dlleo + Yo lwil If — alloc 
i=1 


N 
= (/ |w(x)| dx + Sm IF = alleex 
D i=l 


Taking the minimum over all g € Pia gives the desired result. 
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Note that if our integration method is exact for constant functions (which is necessary 
for any order of convergence), we have 


N 
/ w(x) dx =)" wj. 
D 


i=] 


So if w(x) => 0 for all x € D and w; > 0, we have the integration error bounded by 


a) w(t) de If alles: 
D 


Thus, we can bound the integration error in terms of the minimax approximation 
error. 

However, Newton—Cotes methods of higher order have negative weights, as can 
be seen in Table 5.1.1. In fact, if wi are the weights of the kth order Newton—Cotes 
method, then [206] 


k 
(k) k 
Wj > C as > ow. 
j=0 


In fact, eS Jw grows exponentially in k as k + oo [206, p. 279]. This makes 
them unsuitable for high order integration methods. However, the trapezoidal and 
Simpson’s methods from the Newton—Cotes family do have positive weights, and in 
fact are very robust methods. 


5.1.3 Product Integration Methods 


In dealing with singularities, such as is x~° f(x) dx or { Inx f(x) dx, weneed new 
methods. A simple approach we can use is to apply polynomial interpolation for f (x), 
and determining the exact value of the integrals of the interpolating polynomials. For 
evaluation points x9, X1,...,Xn, we write f(x) © pa(x) := "9 f (xi) Li(x) and 
we use 


b b n b 
/ w(x) f(x) dx © / w(x) Pn(x)dx = D> fx) / w(x) Li(x) dx. 
a a i=0 a 


The error in these approximations is bounded by 


b 
|w(x)| dx Il f — Pnlloo - 


a 
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We can use whatever interpolation scheme is appropriate, including equally-spaced 
interpolation points, and using Chebyshev interpolation points. 

Product integration methods should not be restricted to cases of singularities. 
Product integration methods are also appropriate for smooth integrands, such as 


b 
i exp(—G(x —X)*) f(x)dx with 8 >> |b—al~?. 
a 

Large value of @ means that to obtain reasonable accuracy using ordinary integration 
methods requires a spacing of interpolation points h « 3~'/*. However, we can 
accurately compute these integrals with h >> 3-'/? if we use 


b b 
ip exp(—((x — %)*) f (x) dx » / exp(—(x — %)*) pa(x) dx, 


and using exact, or near-exact, methods for computing the integral on the right. As 
was noted in Section 6.1.4, itis more important to reduce spacing than it is to increase 
degree of polynomial interpolants. In that sense, we find that when dealing with a 
singularity at x = a in an integral 


b M-1 Zj4l 
/ w(x) fix)dx = > i w(x) f (x) dx 
a j=0 zi 


with a = z < zZ, <-:-: < Zy =D. Product integration methods can definitely be 
applied to the integral 7 : w(x) f(x) dx. However, the interval [z;, Z2] is often not far 
enough from the singularity for standard integration methods to work well. Instead, 
we should continue to use product integration methods. For example, for estimating 


b M1 Zj+l 
i, x? f(x) dx = a x7% f(x) dx 
0 j=0 Qj 


we should compute 


/ x” p(x) dx 


for 0 < u < v and polynomials p of degree < k. To implement these methods, 
we need weights that depend on the interval: 


wi(u,) = f x” ° Li(x) dx, i=0,1,2...,n 


where L; is the ith Lagrange interpolation polynomial for interpolation points 
x; =u+é&(v —u) in [u, v]. Here €; represent standardized interpolation points: 
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€; =i/n for equally spaced points, €; = (1 —cos((i + 5)m/(n + 1)) for Cheby- 
shev interpolation points. 

Gaussian quadrature methods (see Section 5.2) can be applied if we can compute 
the orthogonal polynomials for the weight function w(x) > 0. In this case, very 
high-order methods can be applied as the weights w; > 0, and so Theorem 5.1 gives 
good bounds on the error. 

Product integration methods can also be used for multidimensional integration. 
See Section 5.3. 


5.1.4 Extrapolation 


“ 


If we know the error, we can subtract it out to get a better approximation.” 
While we very rarely know exactly what the error is for our numerical integration 
methods, often we can get asymptotic estimates, such as (5.1.5, 5.1.7, 5.1.9, 5.1.13). 
These enable us extrapolate from computed results using a known method to obtain 
superior results. 

We start by supposing that the error e, = v, — v* in some computation that 
depends on a parameter n has a known asymptotic behavior: 


(5.1.14) én ~Cn* asn>o. 


Then Richardson extrapolation takes the values v, and v2, with the asymptotic error 
behavior 


VU, —vU Cn, 
Von — v* & C (2n)~*. 


Treating these approximations as exact we can solve for a new approximation for 
VU"! Voy — v* & 27° (vn — V*), SO 


—a 
em p(D . Yn 7 278Un 
(5.1.15) ve XU, I= io 
Now vs) — v* asn — oo, but at a higher asymptotic order. 


As an example, consider using the trapezoidal rule for estimating /, ‘ x Inx dx. 
Let T,, be the estimated value using the trapezoidal method with n + 1 function 
evaluations. From (5.1.7) we have T,, — T* ~ C n~* where T* is the exact value of 
the integral, so a = 2. Then Richardson extrapolation gives the estimates 
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Fig. 5.1.3. Errors for 10° 
Romberg method for 


fig x Inx dx en 
10° [ el) =| 


“S 2) 
101° 
ef) 
10°15] 
10° 10! 10° 
n 

1 

po — Pat atn 

2n 1 1 - 
4 


It turns out that vie ) is the result of Simpson’s method with 2n + 1 evaluations. The 
error e() ~ C) n~4 as n + 00, so we can repeat the process to get 


Continuing in this way we have 


G/) —2j-27M) 
Ty eae S 
1— 2-272 


(5.1.16) Ce for j =0,1,2,.... 
This repeated application of Richardson extrapolation is the Romberg method [215]. 
The error ef!” ~ C\) n-2/-? as n > 0. This can give very fast convergence, as can 
be seen in Figure 5.1.3. 

This approach can be used for other techniques and integrals with other asymp- 
totic behavior. Consider, for example, computing i. /x e* dx. If we use the trape- 
zoidal method we still get convergence: e, = T,, — T* — 0 as n > oo by Theo- 
rem 5.1. The error is shown in Figure 5.1.4. The slope of |e,| the straight part of the 
graph on the log-log plot is estimated to be — 1.477; using a = —3/2 for applying 
Richardson extrapolation, we get spe = (Ty, — 273/7T) /A- 2-3/2). The errors 
e“) = T _ T* go to zero faster. An empirical estimate of the slope on the log—log 
plots obtained as — 1.997; using a = —2 for applying Richardson extrapolation, we 
get T = (Ty? — 2-27) /(1 — 2-2). The slope of the errors e2) = 7,2) — T* was 
estimated to be —2.493, which is remarkably close to —5/2. Repeating Richardson 
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Fig. 5.1.4 Romberg method 
adapted to ae J/x e* dx 


extrapolation with a = —5/2 we get TF? = (TS? — 2-527.) /(1 — 2-9/), a third 
derived sequence where the error has a slope estimated to be about —2.96 which is 
very close to —3. Figure 5.1.4 shows that this gives a much more rapid reduction in 
the error as n increases. 

The reason why the Romberg method works is that there is a fully developed 
asymptotic expansion 


Proving this for smooth f leads to Euler-MacLaurin summation, a way of estimating 
sums via integrals (or vice-versa) that uses a fully developed asymptotic expansion. 
Fully developed asymptotic expansions 


f(m) ~ Cigi(m) + Crgo(n) + C3g3(n) ++: 


mean that gj41(”)/gj;(n) > Oasn > o, 


f(n)/gi@) > Cy asn —> oo, and 
[ f() — Cigi(@) —--- — Cjgj(n)| /gj4i1@) > Cj41 asn > oo forj =1,2,.... 


The error due to the trapezoidal method can be represented via an integral: 


1 


1 1 1 
/ f(x) dx = 5 Lf) + fd] — af f(x) Bo(x) dx, 
0 * J0 


where B(x) = x(1 — x) is the quadratic Bernoulli polynomial. Repeated applica- 
tions of integration by parts along with the definitions of Bernoulli numbers and 
polynomials give 
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' 1 ae Boi, (2k—1) (2k—-1) 
[ f(x)dx = 5 [fO)+ fd] pd On [f (1) F (0)] 
6.1.17) -onf f°? (x) Boj (x) dx. 


The Bernoulli numbers B;, are given by the formula 


(5.1.18) 


= y B;—  forallt, 
J! 
=0 


u. 


while the Bernoulli polynomials B;,(x) are given by 


t(ex! _ 


(5.1.19) te = => "B 10s for all t. 


e! 
j=0 


We leth = (b — a)/n bea fixed spacing: x;4; — x; = h for alli. Scaling and shifting 
(5.1.17) to obtain J as J (x) dx by, and then summing over i gives 


b 


n—1 
f(x)dx =h |; (f(a) + f(b)) + 2 fi ] 


a 


So Bok Qi-D(p (2j-D 
ve mi Oe 


k=1 


a au f°) (x) Bj (x — a)/h) dx, 


which gives us the fully developed asymptotic expansion for the error in the trape- 
zoidal method: 


(oe) 


b-— B . . 1 
(5.1.20) o> ( oo Pay 2k [fA (o) _ FI OS 
k= 


The infinite sum is understood as an asymptotic expansion, not as a convergent series. 


Exercises. 


(1) Apply the rectangle, trapezoidal, and Simpson’s rules to estimate ss x 
cos(x2) dx with n + 1 function evaluations (n function evaluations in the case 
of the rectangle rule) for n = 2") k=1,2,..., 15. Plot the error magnitude 
against n on a log—log plot. Estimate the exponent a for the error asymptotes 
lerror, |  C n~° for the three methods. Note that the exact value is 5 sin(100). 
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(2) Repeat the previous Exercise for estimating 7 dx/(1 + x7). Exact value is 
2 tan—!(5). 
1/2 
(3) Repeat the previous Exercise for estimating i x —/2 | dx. Exact value is 


527/4(1 + i= 13/2]. 

Repeat the previous Exercise for estimating i d0/(1 + sin? 0). Exact value 
27. 

Show rapid convergence of the rectangle rule fig Ff (x) dx for smooth periodic 
f: f(t +2) = f(t) for all t. [Hint: First show that the rectangle rule with 
n function evaluations on [0, 27] is exact for all trigonometric polynomials 
yyeo [ax cos(kx) + b;, sin(kx)]. Now show that f smooth implies that f can 
be accurately approximated by truncated Fourier series of order n (which is a 
trigonometric polynomial). This can be done using integration by parts. ] 

The Clenshaw—Curtis integration method [51] is to compute the finite Cheby- 
shev expansion f(x) © )~)_9a%7x(x) from interpolation of f(xj), j= 
0,1,2,...,n, with x;=— cos(ja/n). This can be done efficiently using the 
DFT. Show that a f (x) dx © Yio ax Jo cos(k) sin6d0. Compute the 
exact value of i cos(k@) sind dé. 

Fejér [90] used interpolation at the points x; = cos((j + +)n/n), j=09, 
1,2,...,n — 1, to create the integration scheme i {@)dan ya fe) 
e L;,(x) dx where Lj, is the Lagrange interpolation polynomial. Show, as 
Fejér did in 1933, that ae Li(x) dx > c fork = 0,1, 2,...,n. [Hint: Start 
by showing that L;,(cos @) = 1+2 ee ,cos(jO) cos(j aa + ae. ] 

(8) Implement a tensor-product Simpson’s method in two dimensions: if i f(x) 
dx © (b—a) yy wif (a+ (b —a)&) then f” f" f(x, y) © (b—a)(d —0) 
a 3G Yj=0 ww; f(a + (b—a)&, c+ (d —c)€&;). Implement this, if possi- 
ble, using a one-dimensional Simpson’s rule code. 

The asymptotic trapezoidal expansion (5.1.20) can be used to give accurate 
estimates of infinite sums. Consider the problem of estimating )°72.9 1/(1 + 
k?). Since f° dx/(1 +x?) = /2— tan! N approximates $(1 + N?)~!+ 
au war 1/U + k?) with an asymptotic error formula (5.1.20) we can obtain 
asymptotically accurate estimates for $(1 + N?)"'+ oy 4, 1/(. + k?). 
Added to the numerically computed value paar 1/d+k*)+ $(1 + N?)-}, 
this can give accurate estimates for }-?° 9 1/(1+k?). Use N=107, 103, 10°, 10° 
and asymptotic error estimates with one, two and three terms to obtain estimates 
of -?29 1/1 + k?). Check that these estimates appear to be consistent with 
each other. 

(10) Develop a product integration method for computing 


(4 


Ym 
e: 
n 


(5 


wm 


(6 


we 


(7 


wm 


(9 


YS 


b 
i: (2m)! exp(—(x — 1)?/(207)) f (x) dx 
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that remains accurate even for small o > 0 and arbitrary yp € R by using a 
piecewise quadratic interpolant of f. Note that you will need to have explicit 
formulas for cL exp(—(x — iy (a7) x’ dx for k = 0, 1, 2. These formulas 
can involve the error function erf(z) = (2/,/7) ie ee dz, which is available 
on most modern computing systems. 


5.2 Gaussian Quadrature 


When we use the “integrate the interpolant” approach to numerical integration, we 
expect the order of accuracy of the computed integral to be the order of accuracy of the 
interpolant. But, as we have seen with the mid-point and Simpson’s rules, sometimes 
we get a better order of accuracy than expected. This is super-convergence. Both the 
mid-point and Simpson’s rules give an order of accuracy one more than expected. 
Newton—Cotes methods using even degree interpolation show this superconvergent 
behavior: the method is accurate for degrees one higher than the interpolant. 

In the case of these methods, because the integration rule is symmetric under 
the reflection of the interval [a, b] with weights unchanged, we obtain supercon- 
vergence with the order of accuracy one more than expected. How far can we take 
this approach? If we choose n interpolation points x; and weights w;, when can we 
guarantee that 


b n 
(5.2.1) [ pear = Yow pe 
7 i=1 


for all polynomials p of degree < k? In this way of thinking, we have n evaluation 
points x; andn weights w, for the integration method, giving a total of 2n parameters 
to define the method. To satisfy the equality in (5.2.1) for all polynomials of degree 
< k, we need to satisfy k + 1 independent equations. Balancing unknowns and equa- 
tions gives 2n = k + 1, so we have reason to expect that n function evaluationss we 
could make (5.2.1) hold for all polynomials of degree < 2n — 1. How to achieve that 
is the subject of this section. The theory is based on orthogonal polynomials (see 
Section 4.7.2). 


5.2.1 Orthogonal Polynomials Reprise 


Given an inner product on functions 


b 
(5.2.2) (f, 8) =] w(x) f(x) g(x) dx, 
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with integrable w(x) > 0 for all x € (a, b), a family of polynomials ¢o, ¢1, 2, ... 
is a family of orthogonal polynomials if (¢;, ¢;) = Oforalli A j,anddeg ¢; = j for 
j =0, 1, 2,.... Scaling each member of this family by wy; = a; 6; witha; 4 0 gives 
anew family of orthogonal polynomials with respect to the inner product (5.2.2). In 
fact, given the inner product (5.2.2), every family of orthogonal polynomials can only 
differ by these scalings. Since deg @; = j, the span of {@o, ..., de} is exactly the set 
of polynomials of degree < £. Thus ¢¢+; is orthogonal to P¢, the set of polynomials 
of degree < £, which has dimension é + 1. On the other hand, ¢¢,; is in Pe41, which 
has dimension £ + 2. Thus @¢+; has to lie on a one-dimensional subspace of P+. 
The only choice determining ¢¢,+, is its scaling. 

Properties and examples of orthogonal polynomials can be found in Section 4.7.2. 
One property not in Section 4.7.2 that is relevant to us here is the number of zeros 
of @; in (a, b): 


Theorem 5.2 Ina family of orthogonal polynomials {o, $1, . . .} with respect to the 
inner product (5.2.2), @; has exactly j simple roots in (a, b). 


Proof First we show that ¢; has at least j roots in (a, b) by contradiction. Suppose ¢; 
has r < j roots in (a, b). Then the number of points x at which ¢; (x) changes sign is 
s <r < j.Call these points z;, z2,..., Zs € (a, b). Letw(x) = lEa@ — z;) which 
has degree s <r < j. Because deg yi = s, we can write ~ as a linear combination 
of do, ¢1,---, @s, each of which is orthogonal to ¢;. Therefore, 7 is orthogonal to 
g; and so 


b 
t=. we 1 ila Cae. 


a 


The integrand w(x) @;(x) 7(x) does not change sign between positive and negative 


and can only be zero at a finite number of points. Thus, (¢;, Y)w = ie w(x) dj (%) 


w(x) dx 4 0, which is a contradiction. 

Thus ¢; has at least j roots in (a, b). To show that @; has at most j roots in (a, b), 
we recall that @; has degree j, and so the total number of roots of ; is j, counting 
multiplicity. Thus the multiplicity of each root must be one, and there cannot be more 
than j of them. That is, ¢; has exactly j simple roots in (a, b). 


Since scaling a polynomial does not change its roots, the inner product uniquely 
specifies the roots of @; in any family of orthogonal polynomials. 


5.2.2 Orthogonal Polynomials and Integration 


We seek an integration method 


b n 
i w(x) f(x)dx © Do wi f Gi) 
4 i=l 
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that is exact if f is a polynomial of degree < 2n — 1. These methods are Gaussian 
quadrature methods. If we have already chosen the points x;, then the weights w, 
are given by 


b 
(5.2.3) wi = / w(x) L(x) dx 


a 


where L; is the ith Lagrange interpolation polynomial (4.1.3) for the interpolation 
points x;,%2,...,X,. With n interpolation points, each L; has degree n — 1. This 
ensures that the method is exact for all polynomials of degree <n—1.If fisa 
polynomial of degree < 2n — 1 we can use synthetic division to write 


FO) = 0) bal) + ra), 

where q and r are polynomials of degree < n — 1. Then 

b b 
/ w(x) fade = / w(x) [4(x) bax) + r(x] dx 
b b 
= i w(x) g(x) bal) dx + / w(x) r(x) dx. 
b 

= (q, on) +f w(x) r(x) dx. 


Now (q, on) = 0 since q is a polynomial of degree < n — 1, and therefore can be 
written as a linear combination of ¢9, $1, ..., 6,1, each of which is orthogonal to 
on. On the other hand, by the construction of the w;’s in (5.2.3), the method is already 
exact for polynomials of degree < n — 1, so it must be exact for r(x). Therefore 


a a 


b b n 
/ w(x) f(x) dx = / w(x) r(x) dx = a w; r(x). 
i=l 
What we want is 
b n n 
i‘ w(x) f(x)dx = Do wi fi) = D> wi (ai) bn (ai) +r @)I- 
a i=l i=l 


In order to guarantee this, we need aan w; g(x) én (%;) = 0 for any polynomial q 
of degree < n — 1. Setting g(x) = L;(x) for j = 1,2,..., we see that ¢,(x;) = 0. 
That is, the evaluation points should be the roots of ¢,. 

In summary, to create an integration method 


b n 
/ w(x) fx)dx © Yow; fx) 
a 1 
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that is exact whenever f is a polynomial of degree < 2n — 1, we first compute 
the roots of @, which is the degree n polynomial from the family of orthogonal 
polynomials for the inner product (5.2.2). The evaluation points x; are the roots of 
on. We then set the weights w; according to (5.2.3). 


5.2.3. Why the Weights are Positive 


In contrast to the high order Newton—Cotes integration methods of Section 5.1.2, 
the weights for Gaussian quadrature methods are all positive. To see why, con- 
sider f(x) = L;(x)? where L; is the ith Lagrange interpolation polynomial for 
X1,X2,...,X,- Since we have just n interpolation points, deg L; = n — 1, and there- 
fore deg Le = 2(n — 1) = 2n — 2 < 2n — 1, so the method is exact for i 


b n 
0< / w(x) Li(x dx =) w; Li)’. 
a j=l 


But L;(x;) = lifi = j and zero ifi # j, so Le Wj Li (x;)* = w;; thus w; > 0. 
Having positive weights is an important bonus as then we can apply Theorem 5.1 
to show that the integration error is bounded by 


b 
2 | w(x) dx If —dlleo 


for any polynomial g of degree < 2n — 1. 

To estimate the errors in a more conventional way, we can use the Hermite inter- 
polant p,(x) where p,(x;) = f(x;) and p’,(x;) = f'(x;) fori = 1, 2,...,n, which 
is a polynomial of degree 2n — 1: since p, is exactly integrated by this method, 


b n 
[ we ferax - Ow, fa) 
@ i=l 
b n 
= f won Fe) — Fula dx — YY ws (FG) — Paes) 


i=] 


b 
_ } w(x) (fF () — Balx)) dx 


b (2n) n 
fe" (ex) 
= [0 aye [ee 


7 f2(c) b n ‘ 
= Ont J, ve) le=ap dx 
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for some c between a and b by the Generalized Mean Value Theorem as w(x) []}_; 
(x — x;)? is never negative on (a, b). 

In the case where w(x) = | for all x, the orthogonal polynomials for (f, g) = 
fy J (x) g(x) dx are the Legendre polynomials given by the Rodriguez formula 
(4.7.9) 


1 d” n 
Po) = [2 1)". 
©) = rai aat (@—Y) 

Because w(x) is constant, this Gaussian quadrature method can be used as a com- 

posite method with error O(h?") where h is the common width of each piece in 


b Mal pziat 
, f(x)dx = >i f(x) dx 
a j=0 zj 


M-1 n 


= S> Slaw; fej +h +4;)/2) 


j=0 i=] 


where x; is the ith zero of P, in (—1,+1). The points x; and weights w; can be 
efficiently computed via the coefficients of the three-term recurrence relation (4.7.8) 
by the algorithm of [106]. 


Exercises. 


(1) From the Rodriguez formula for Legendre polynomials, compute (P,,, P,,) and 
et hay= 7 Bbis Pai@dx: 

(2) Determine the coefficients a,, by, and c, where Py41(x) = (A@nx + by) P(x) — 

Cn Py—\(x). Re-arrange these equations to find coefficients a, and (3, so that 

x P,(x) = Yn—1 Pn—1(X) + On Py(x) + Bn Pnoi@). 

Suppose that the polynomials po, pi, p2, ...areafamily of orthogonal polyno- 

mials with respect to the inner product (f, g)y = i w(x) f(x) g(x) dx. Once 

the roots x9,x1,...,Xn Of Pnr+1 have been found, the weights for Gaussian 

quadrature can be computed as w; = rie w(x) Lj(x) dx where L; is the jth 

Lagrange interpolation polynomial. Show that Lj (x) = pnyi(x)/[Ph4, (aj) — 

x;)]. 

(4) Compute the family of orthogonal polynomials (up to a scale factor for each 
polynomial) of all degrees < 5 for (f, 2) y = fo —In(x) f(x) g(x) dx. 

(5) Use the results of the previous Exercise to give a 5-point formula that can compute 
yi —Inx f(x) dx whenever f is a polynomial of degree < 9. 

(6) Compute the family of orthogonal polynomials (up to a scale factor for each 
polynomial) of all degrees < 5 for (f, g)w = i x—!/? F(x) g(x) dx. 

(7) Compute the orthogonal polynomials in Exercise 6 via the Rodriguez formula 


(3 


wm 


a" 
Pn(x) = cy x? aah [Fa - *)"| ; 
xn 
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(8) Using polar co-ordinates, develop a method to compute integrals over the unit 
disk D: f, f(x, y)dx dy =f." J) f(r cos8, r sin0)rdr dO. Use scaled and 
shifted Gauss—Legendre quadrature in r for the interval [0, 1], and the rectangle 
tule in @ (because the integrand is periodic in #). Use n integration points for 
each polar co-ordinate with n = 2% k=1,2,...,5. Test this method on the 
function f(x,y) =e*(_+ pee By yr les for the “exact” value, use n = 2°. 

(9) Using spherical polar co-ordinates, develop a method to compute integrals over 
the unit ball B: 


, 1 ptna/2 ple 
[ f(x, y, z) dx dy dz -[ [ f(r cos@ sing, r cos @ cos ¢, r sin) r2 cos 0 d¢dédr. 
B 0 J-n/2 Jo 


Use Gauss—Legendre quadrature in r and 6, and the rectangle rule in ¢. Use n 
points in each polar co-ordinate to give an estimate for the integral: 


n-1 
’ yee. A) 2 pO F : . Aa 2 
[ te.» davayas~ sy We We F (rj C08 Oy sin be, 7; COSA} COS by, 7; Sin Dj.) 77 COS Ak. 
jke= 


5.3. Multidimensional Integration 


Once one-variable integration is mastered, multivariate integration can be understood 
as repeated one-variable integration. In this way, multivariate integration should 
not be much harder than one-variable integration. But many integration regions do 
not have such nice Cartesian product structure. High-dimensional integration also 
requires different approaches to avoid exponential growth of the computational cost 
as the dimension increases. 


5.3.1 Tensor Product Methods 


Tensor product methods are based on the observation that integrals over rectangles, 
and rectangular solids, can be expressed as repeated one-variable integrals. Specifi- 
cally, if R={(x,y)|a<x<b,c<y<d}, then 


b d 
[ tonacay= f i f(x, y) dy dx. 
R a Cc 


A one-variable integration rule can be used in each co-ordinate: i g(x) dx & 
ye Vig (Xi), ifs h(y) dy © }¥_, w;h(y;). Combining these gives 


m n 


b pd 
(5.3.1) [ [se navar= OY ww, sou.yp, 


i=l j=l 
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In d dimensions, if R = [a,, b;] x [a2, bo] x --- x [ag, ba] C R@ is the region of 
integration, then we can use a single rule e h(x) dx = (b—-a) ae w; h(x;) across 
each variable of integration to approximate 


n 


(5.3.2) [ teas > 
R 


i} ,i2,..,i¢=1 


d 
Wee ric ee: 
j=l 


if R = [a, b]¢. The number of function evaluations needed is n?. While this is usually 
acceptable in two or three dimensions, it becomes much less so in higher dimensions. 


5.3.2 Lagrange Integration Methods 


Lagrange integration methods continue the tradition of integration methods based 
on interpolation. In this case, we use Lagrange interpolation on triangles, tetrahedra, 
and other simplices as described in Section 4.3.1. The best known method is arguably 
the one based on the vertices of a triangle K C R?: 


1 
(5.3.3) i f(x) dx © 3 (f(v1) + f(v2) + f(v3)) area(K), 
K 


which is based on linear interpolation on the triangle. For a tetrahedron, K C R* we 
have 


4 
(5.3.4) i f (x) dx © oS f (v;) vol(K). 
K i=1 


We can develop formulas for integration on a reference triangle, tetrahedron, or 
simplex K, and transfer the formula to a given element K viaan affine transformation, 
x= T x(x) = Axx + bx: 


(5.3.5) [ pena = ect ant [pare @yaz. 
K K 


If an integration method is exact for polynomials of degree < k on the reference 
element K, then using (5.3.5) will give a method that is exact for polynomials of 
degree < kon K. 
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5.3.3, Symmetries and Integration 


Symmetries can be very helpful in understanding integration methods on shapes like 
triangles and tetrahedra. An affine symmetry of aregion R C R¢ is an invertible affine 
transformation R — R. The set of affine symmetries R — R forms a group under 
composition. The order of a group is the number of transformations in the group. 

Unlike an interval [a, b] which just has a reflection symmetry x > a+b—x 
(the group of affine symmetries has order two), the group of affine symmetries on 
a triangle has order 3! = 6, and on a tetrahedron has order 4! = 24. Squares have 
groups of affine symmetries of order 2? x 2! (involving reflections about each axis 
and (x, y) <> (y, x)) while cubes have symmetries of order 23 x 3!. Circles, disks, 
spheres, and balls have infinite affine symmetry groups. Symmetry can be helpful 
in improving the order of accuracy for a given number of evaluation points for an 
interval. Symmetry is discussed in Section 5.1.2 as the source of super-convergence 
for symmetric Newton—Cotes methods of even order. 

Most integration methods on regions R in R¢ are symmetric in ways that reflect 
the symmetry of R. Of particular interest are points that are invariant under every 
affine symmetry of R. For triangles, the centroid is invariant under every affine 
symmetry of the triangle. For disks or spheres, the center is invariant under every 
affine symmetry. Often other useful points are often invariant under a subgroup of 
the affine symmetries, such as the mid-point of an edge, which is invariant under the 
subgroup that keeps the edge fixed. 

Just as was noted regarding the mid-point and Simpson’s rules in one variable (see 
Sections 5.1.1.1 and 5.1.1.2), symmetry of the interpolation points and weights is 
often related to an improved order of convergence. For example, the method (5.3.3) 
is invariant under all affine symmetries. Even better, though, is the one-point rule 
evaluated at the centroid: 


(5.3.6) [ f(x)dx © pA) area(K). 


Both (5.3.3) and (5.3.6) are exact for linear functions on triangles. 

Integrating the Lagrange quadratic interpolant (4.3.4) gives a method that is exact 
for quadratic functions. Interestingly, in this case, the integrals associated with the 
Lagrange interpolating polynomials for the vertices evaluate to zero; thus only the 
values at the mid-points need be evaluated, giving the rule 


v1 + U2 “ mo 


(5.3.7) [ ferde~ 5 Ege +e fC yr F( 


| area(K), 


which is a three-point formula that is exact for all quadratics. Again, this rule is 
invariant under all affine symmetries of a triangle. 
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Table 5.3.1 Types of groups of points by symmetries. Note that unless indicated, all co-ordinate 
values are distinct (a #4 3 #7 etc.). Also the sum of the barycentric coordinates must be one. 


Symmetry Barycentric 
type coordinates # Points 
I (3359) 
0 (a, a, 3) 3 
Il (a, 3, y) 
(A) For tetrahedra 
Symmetry Barycentric 
type coordinates # Points 
I (sas aoa) 
II (a, a, a, 3) 4 
m (a, a, 8, 3) 6 
IV (a, a, 8,7) 12 
Vv (a, B, 7, 0) 24 
(B) For triangles 


5.3.4 Triangles and Tetrahedra 


High-order integration formulas for triangles, tetrahedra, and other shapes have been 
found by a number of authors (see the foundational book [244] and the survey 
papers [57, 58, 167]). More recent high-order formulas have been obtained by direct 
application of numerical methods to the conditions rather than assuming symmetry 
properties (for example, see [252] for spherically symmetric regions). However, 
for low to moderate degree, symmetry principles have been very useful for finding 
efficient rules over triangles and tetrahedra (see, for example [168] from 1975 for 
triangles). 

A question that may arise is why to have high-order rules of order, say ten or 
more, for most applications calling for integration over triangles or tetrahedra? 
Rarely do we need such high order accuracy. Finite element methods for second 
order elliptic partial differential equations need to compute integrals of the form 
rf Ux) Voi (x)' Vo |; (x) dx over a triangle or tetrahedron K. If we used, for exam- 
ple, the Argyris element for continuous first derivatives, the basis functions ¢; have 
degree five, which means that V@; (x) V9; (x) has degree 2 x (5 — 1) = 8. Thus 
the degree of polynomials that our integration method integrates exactly should be 
at least eight. An integration method of order ten, exactly integrating polynomials 
of degree < 10, would then be able to exactly integrate ne xc Ux) Vo; (x) Ved j (x) dx 
where a(x) is a polynomial of degree < 2. Thus the error for general a(x) would 
then be O(hi. |K |) rather than O(A}’|K|) where hx = diam K. Thus the order of the 
integration method needs to be co-ordinated with the order of the elements chosen. 

Below we show some formulas for low to moderate order on triangles and tetra- 
hedra. The points are represented by their barycentric coordinates. All the methods 
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in the tables below are symmetric methods, so it is not necessary to list all the points. 
Instead, we list one point in a set, and the other points in the set are generated by 
applying the relevant symmetries. In terms of barycentric coordinates, applying the 
relevant symmetries simply means permuting the coordinates. 

The different types of symmetric sets of points for a triangle are listed in 
Table 5.3.1. 

The methods have the form 


m 


[ fevas ~1K| Do wi fi); 
i=l 


again, the weights are not repeated as the weight w, is the same for each symmetric 
set of points x;. The methods for triangles are shown in Table 5.3.2, and the methods 
for tetrahedra are shown in Table 5.3.3. 


Exercises. 


(1) Show that for any region R C R", if ¥p = [px dx/ {, dx is the centroid of 
R, then the one-point rule i p J (x) dx & vol,(R) f (XR) is exact for all linear 
functions f. Note that vol,,(R) = 4 p@x is the n-dimensional volume of R. 
Show that the symmetric rule for a triangle K with vertices v;, v2, v3 and centroid 
vp = 4(v, + v2 + v3), the rule f, f(x) dx © area(K) [3 f(vo) + 4 (foi) + 
f (v2) + f(v3))] is exact for all quadratic f. 

Let K be the triangle with vertices (0, 0), (1, 0), and (0, 1). Suppose a symmetric 
integration rule /, gf(x)dx ® ee w, f(v;) is exact for all linear functions 
and also for f(x, y) = x7; show that the rule is exact for all quadratic functions. 
{Hint: Consider the action of the affine symmetries x < yand (x, y)b (y, 1 - 
x-y)h U-x—-y,x) of f(x,y) = x. Show that apart from linear terms, 
this can generate all quadratic monomials x, y”, and xy.] 

Following the previous Exercise, consider K to be the unit tetrahedron with 
vertices 0 and e;, j = 1,2,3, in IR?. Suppose a symmetric integration rule 
iF f(x)dx ® yi w,; f(v;) is exact for all linear functions on K. Suppose 


(2 


wa 


(3 


wm 


(4 


Ym 


also that it is exact for f(x, y, z) = x and for f(x, y,z) = xy. Show that it is 
therefore exact for all quadratics on R?. [Hint: Use the affine symmetries x < y, 
x <> z,and y < z; then use the symmetry (x, y,z) B (y,z, l|-—x-—y-—z).] 
Using the decomposition of a triangle into congruent sub-triangles below, we 
estimate the integral of |z FI (x, y) dx dy for f(x, y) = e*-” cos(x + 2y) where 
K is the standard unit triangle with vertices (0, 0), (1, 0), and (0, 1). We do this 
by subdividing K m times in this way recursively for m = 1,2,...,6, and 
applying the centroid rule for each triangle in the final subdivision. Note that 
after m subdivisions, there are 4” triangles. 


(5 


wm 
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(A) Tetrahedron (B) Octahedron 


Fig. 5.3.1 Tetrahedral—octahedral decompositions 


Plot the error in the computed integrals against m, the number of subdivisions. 
How does the error relate to the number of function evaluations? 

(6) Unfortunately, it is not possible to subdivide a tetrahedron into 8 congruent sub- 
tetrahedra. However, it is possible to have a mutually recursive decomposition 
of a tetrahedron into 4 tetrahedron and an octahedron, and of an octahedron 
into 6 octahedra and 8 tetrahedra. This decomposition is described in [229]. 
In this Exercise, we will verify that if we begin with either regular solid, then 
the resulting subdivision is into regular solids with the edge lengths halved. 
Applying affine transformations ensures that the resulting subdivisions of any 
given tetrahedron will be into congruent tetrahedra and congruent octahedra. 
Figure 5.3.1 illustrates the decompositions. 


(a) The decomposition of the tetrahedron creates four sub-tetrahedra by joining 
each vertex of the original tetrahedron to the midpoints of the incident edges. 
Assuming that the original tetrahedron is regular (all edge lengths are the 
same, so each face is an equilateral triangle), show that all edge lengths of the 
sub-tetrahedra are half the edge lengths of the original tetrahedron. Show 
that the solid remaining after removing these four tetrahedra is a regular 
octahedron. 

(b) To decompose a regular octahedron, join each vertex of the original octa- 
hedron to the midpoints of the incident edges and the centroid. Show that 
this creates six regular sub-octahedra. Show that the remainder of the orig- 
inal octahedron forms eight regular tetrahedra formed by joining the edge 
midpoints of each face to the centroid of the original octahedron. 
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(7) Test the methods listed in Table 5.3.2 for integration over triangles, by applying 
them to the integral ze > cos(x + 2y) dx dy where K is the standard unit 
triangle with vertices at (0, 0), (1, 0), and (0, 1). 

(8) Test the methods listed in Table 5.3.3 for integration over tetrahedra, by applying 
them to the integral [z e*~” cos(x + 2y — z) dx dy dz where K is the standard 
unit tetrahedron with vertices at (0, 0, 0), (1, 0, 0), (0, 1, 0), and (0, 0, 1). 

(9) Verify that Radon’s method in Table 5.3.2 is exact for all polynomials p(x, y) 
of degree < 5. Use symmetry where possible to reduce the number of equations 
to check. 


5.4 High-Dimensional Integration 


If the dimension d is larger than three or four, standard integration methods lose 
their effectiveness. Integrating over a hypercube [a, b]¢ with n evaluation points 
on each co-ordinate gives methods needing n@ evaluation points. This is strongly 
exponential in the dimension. In many applications, particularly in connection with 
statistical and machine learning applications, the dimension d is much more than 
five. The dimension d can be hundreds to thousands, or even more. In these cases, 
no tensor product method can be successful. 

For these high-dimensional integration problems, we need a different approach. 
The approach taken will often appear to be more “statistical” in flavor. The analysis of 
convergence of high-dimensional integration methods should not focus on so much 
on the differentiability of a function, but more on variance and variance reduction 
strategies. 


5.4.1 Monte Carlo Integration 


Basic concepts, definitions, and theorems of probability theory and random variables 
are discussed in Chapter 7, especially Section 7.1. Of particular importance, here are 
the concepts of random variables (usually denoted by capital letters), probabilities of 
events Pr[X € E], expectations of a random variable (E[X]), variance of a random 
variable (Var [X]), and independence of a pair of random variables. 

Suppose the random variable X takes values in R“ and has a probability density 
function p(x), so that the probability that X « Eis Pr[X € E] = a p P(x) dx. Then 


(5.4.1) [ pe) f(x) dx = ELF]. 


Alternatively, probability distributions can represented by measures: Pr[X € FE] = 
T(E) for all measurable FE. This relationship between the random variable X and 
the measure 7 is denoted X ~ 7. If 7 has a density function p(x), then we can also 
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write X ~ p as well. The expectation can be written as an integral with respect to 
the probability measure 7: 


(5.4.2) i f(x) m(dx) = E[f(X)]. 


If we take independent samples X;, i = 1,2,..., N, from the same probability 
distribution (X; ~ 7), then the Law of Large Numbers (7.1.16) implies that with 
probability one 


1 N 
(5.4.3) Jim 7 Do fk) = ELI. 
i=1 


The question we need to ask is: “How quickly?” 
Provided E [ S(X | is finite, we can give a simple answer. For a random variable 
Y, the variance of Y is 


(5.4.4) Var [Y] = E[(Y — E[Y])*] =E[Y?] -Epvy, 


which is bounded above by E [Y?] and below by zero. If Y and Z are independent 
random variables, then 


Pr[Y € Eand Z € F]=Pr[Y € E]-Pr[Ze F], 
i[Y -Z])=E[Y]-E[Z], and 
Var [Y + Z] = Var [Y] + Var [Z], 


by Lemma 7.2 and Theorem 7.1. 
Thus the average 


Ay = =o fH) 
i=l 


with the X; independent with identical distribution X; ~ 7 we see that 


[An] = E[f(X)] = if f(x) m(dx) = p. 


Note that if X and Y are independent random variables, f(X) and g(Y) are also 
independent: 


Pr[f(X) € E& f(Y)e F)=Pr[Xe f (LE) &Yeg '(F)| 
=Pr[X € f-'(E)|-Pr[Y eg |(F)] 
= Pr[f(X) € E]-Pr[g(¥) € F]. 
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To see the rate of convergence, we note that 


Var [A,,] = Var E » rx) 
1 n 
= 5) Var lf(Xi)] 
i=1 
1 
= -Var[f(X1)]. 
n 


Let o = Var [F(X,)]”. Then Var [A,] = o*/n. Chebyshev’s inequality (7.1.14) 
then implies that for any k > 0, 


Pr [An — pl > ko/Vn] < 1/k’. 


Thus with high probability, |A,, — | = O(n~'/*). The hidden constant in this O 
expression is Var [ f(X )]'/*. Obtaining more accurate estimates involves finding 
ways to reframe the problem but with a smaller value of Var [ f (X1)]. 

We can estimate Var [ f (X)] in terms of integrals via 


Var [ f (X)] = E[ f(X)*] — ELFQOP 

1 
/ / 5 (F(x) — f9)?? (ax) wdy) 
R¢ R¢ 


1 
(5.4.5) = 5% [(f(X) — f£(Y))7] 


where X and Y are independent random variables distributed identically: X¥, Y ~ a. 
A quick and clear conclusion from this is that Var [ f(X)] < || f le Some improve- 
ments on this basic bound are easy to find: for any constant c, Var[f(X)] = 
Var [f(X) —c] < || f - ele Choosing c = 5 (min, f(x) + max, f(x)) we get 
Var [f (X)] < } (max, f(x) — min, f(x)’. 


5.4.1.1 Variance Reduction: Importance Sampling and Rare Events 


Suppose that X ~ p for a probability density function p(x). Then 


ix~p lL f(X)] = [ p(x) f(x) dx. 


We can sample X differently, according to another density function: 


5.4 High-Dimensional Integration 359 


p(x) 


Fig. 5.4.1 Rare event example of Monte Carlo integration 


i” P(x) ee f (x) dx 


sith, as pix , 
= me |S D(X) al 


/ p(x) f(x) dx 
Rd 


By choosing p appropriately, we can reduce the variance. Ideally, we would have 
(p(x)/p(x)) f (x) constant to minimize the variance of (p(x)/p(x)) f (x). To apply 
this technique we need to know what the ratio p(x)/p(x) is. 

Consider estimating arare event: X > a where X is arandom variable with proba- 
bility density function p(x) whose mode is far below a, as illustrated in Figure 5.4.1. 
While the example is written as a one-dimensional integral, the random variable X 
could be the result of a large computation involving many basic random variables 
X 1, X2,..., Xq; treating the problem in its original form could be a very high dimen- 
sional integration problem. Here, we consider the problem as a one-dimensional 
integration problem to better illustrate certain aspects of the problem. 

The probability density function of the random variable X is p(x), so p:= 
Pr[X >a] = E [ X{a,00)(X)] — ss p(x) dx. Note that yg(x) = lifx € E and zero 
if x ¢ E. The variance of Y{u,.)(X) for a single sample is Var [ Xta,00)(X)] = 
p(1 — Pp) so the standard deviation is /p(1 — p) > Pp if Pp is small. Thus apply- 
ing the Monte Carlo integration method will require many samples. If we instead 
sample X ~ p we have p = Ex~3 [ (p(X) /P(X)) Niacoy(X)]. The variance of this 
method is given by 


xX 2 
Varx~5 | Be Xtao(¥| = Ey 3 | oan X[a,o0) (X) | - P. 


We do not want Ey~5 [ (p(X)*/p(X)) X{a,00) (X)?] to be a large multiple of p. 
Noting that 


xX xX)? 
ee [oO Xe wo) (X) | oe Es |x > a| Prix > al 


we see that we want p(x)/p(x) to be small for x > a. So we should aim to have 
P more-or-less as shown in Figure 5.4.1: Pry.j[X > a] © 4, but p(x)/p(x) is 
small (perhaps O(p)). Using p(x) of Figure 5.4.1 is probably counter-productive: 
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Fig. 5.4.2 Rare event 1074 
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D(x)/p(x) for x near a does not appear to be small, and would result in large variance. 
Even worse, the probability of getting X ~ a for X ~ p would be small, meaning 
that a simulation might give apparently accurate results until the rare event (in the 
simulation) of getting X ~ a occurs. 

As an example, consider the “rare events” problem of estimating Pr[X > a] 
where X ~ Normal(0, 1). The standard Monte Carlo approach is the take indepen- 
dent samples from the Normal(0, 1) distribution, and we count the average num- 
ber of times X > a occurs. However, the variance of \[a,o0)(X) is large compared 
to Pr[X >a]? meaning that the number of samples is likely to be large before 
reasonable estimates can be obtained. The importance sampling approach is to 
estimate Ey~5 [(p(X) /P(X)) Xta,00) (X)]. Since we know the probability density 
functions for these probability distributions, we can directly compute p(x)/p(x) = 
exp(—ax + a? /2). Figure 5.4.2 shows result for this wherea = 4sothat Pr [X > a] © 
3.167 x 10~*. The error estimates are averaged over 10 trials. It shows that the error 
in the estimate using standard Monte Carlo method has 100% error until N > 104. In 
fact, the probability estimate using standard Monte Carlo for these trials is zero for 
N <3 x 10°. Importance sampling, when done right, can give reasonably accurate 
results with a much smaller number of samples. 


5.4.1.2 Variance Reduction: Antithetic Variates 


Another approach to reducing the variance of the estimate is to use antithetic variates. 
This applies where p(x) = p(—x) for all x; we can then take samples X; from the 
probability distribution and estimate 


1 N 
(5.4.6) SLAC S Yif%) + fCXDI 
i=1 
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with X; ~ p and generated independently. Errors in Monte Carlo type computational 
results for integrating i dx/(1 + x”) = 1/4 are shown in Figure 5.4.3 where X ~ 
Uniform(—3, +4) and f(x) = 1/(1 + x”) is evaluated at 5 + X and 5 — X. 

First note that if f is an affine function, then 5 ( f+ fCEx)) = fO= 
es p(Z) f(z) dz for any x. Beyond this, we should look at the variance of the esti- 
mates 


1 N 
AN =5y > Lf (Xi) + f(-X) 1. 


Since the X;’s are independent, it suffices to estimate Var [4 (f(X) + f(x ))| 
where X ~ p. If Qy = 4 (0(- — X) + 6(- + X)) then using the usual inner product 


1 —- 
gO) =e = [ Ox(x) f(x) dx, 


and we want to estimate 


Var (Ox, fy] =E[(Ox, f)]-El(@x, fr. 


Using Fourier transforms, (g, f) = (20) *(F g, Ff). We first note that 


20x, l= (ElOx], f), but 
1 
[Ox] (2) = a px); le — x) +e +x)] de 


1 
= 5 [p(z) + p(—z)] = pz). 


Thus 


S[(Ox, f)] = (2m)4 (Fp, Ff). 


Let o(€) = Fp(€) = gene" for X ~ p. Note that 


= 
R¢ 


(5.4.7) 161 = | / e'€'™ n(x) dx e®'*| pn(x)dx =1 forall é. 
Rd 


Then, using the fact that Q is real, 


2((Qx, fy = (20) [ (OO FLIEG) dé [ ; (mM) F f(n) dn 


= (2n)4 i [ FROF fy F@ om) aé dn, 
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Since p(x) is real and p(x) = p(—x), o(€) = Fp() = Fp(-§) = Fp(§) = 
p(€). So 


(Ox. PP = On) i i FIO FFM oH) om dé an, 


On the other hand, 


B[(@x. 1°] = [pe (Ox £Pax 


= (2n)~24 [ | PO) (FOx, F fy ax 


= (2m) [ se) / FOx(t) FF @ dé i, FOut) Ff(n)andx 
Rd R¢ Rd 
= (2n)~74 [ i: FF(©)F f(n) dé dn [ p(x) F Ox(€) F Ox() dx. 
R¢ JR@ R¢ 
The inner integral 
/ p(x) F Ox (€) FOx(m) dx 
Ra 


dis ne wee ; 
= / pia pete) a(t ede 
ee? 2 


= if e@ (c tnx 4 pie" 4 piety x 4 pie vs) ie 
1 

= 7 E+) + oe—-m + o-€+m) + o-€—)) 
1 

=5 (P€+m+o(E-—)), — using o(C) = o(-4). 


Combining the integrals gives 
Var (Ox, $1 = ny ™ i ; [ FIO FFM x 
x E (HE +0) + 6 -) - aceon | dé dn 
= any ff FFOF La vavmagdn where 
davis m) = 5 OE +m) + 6 —m) — 6400. 


Since ¢(0) = | we can see that w,y (0, 0) = 0. Also, since p(x) > 0 for all x and 
Fea p(x) dx = 1, 6(§) = F p(&) is real analytic in €, and therefore, so is Way (€, 7) 
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in (€, 7). In addition, (€) > 0 as ||E||, — oo and $(€) = o(-—&) for all €, so the 
Taylor series expansion of ¢ about € = 0 can only contain even powers of &. 
We can bound 


Var (Ox, f)] < my if / AFFOMFLOD Wav, mI dé dn. 


Unfortunately, supe ,, W(€, 7) = 5 (take € = 7 and ||&||, > oo). 


By comparison, the standard Monte Carlo method would use oF = 6(-— X) 
and we get 


var[(0?. #)J=em™ ff FPO FFD tog - m - oo dean 


= (2n)-%4 i [ FIO FF iuc€, mdédn, where 
bucté, ) = o€ — 0) — dOd(n). 


For¢ © 0,6(¢) = o(—¢) = 1-—¢' BC+ OUIlcll3) for asymmetric, positive def- 
inite matrix B using (5.4.7) and 6(¢) = ¢(—¢) = ¢(¢). Then 


vuc(&, m = o(€ — 9) — o(€)o() 
=1-(€-m)"B(E—n) — (1 — £7 BE)(1 — 9" Bn) + OWIEIS + Iinl3) 
= 26" Bn = O(IlEll5 + Inll3), while 


1 
Wav (E,) = 5 [o(é +n) + H(€ — )] — (€)6() 


1 
=; [2—(€-—)’ BE —n) — E+) BE+D)] 
— (1-€7BE)(1 — 9" Bn) + O(MEl3 + Ill) 
= O(lEll3 + Ill). 


This means that for €, 7 ~ 0, |w,y (€, 77)| is generally much smaller than |wyc(&, )|. 
Thus with smooth functions f, F f(€) — 0 rapidly as ||&||, — ov, and antithetic 
variates should give much lower variances. 


Example 5.3 Figure 5.4.3 shows the average of the absolute value of the differ- 
ence between the integral estimate and the exact value of Hk dx/(1+x?) = 7/4. 
The average was taken over 100 trials, using Matlab’s Mersenne Twister random 
number generator with seed 32451098. The value of N is the number of function 
evaluations used. The control variate method, described in the next section, is used 
with y(x) = 1 — x/2. The slopes of the lines are estimated as between —0.45 and 
—0.47; theoretically, they should be —1/2 indicating an error of O(N~'/?). The 
small discrepancy is likely due to the pseudo-random variation in the precise values 
computed. 
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Fig. 5.4.3. Errors in standard 107 
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5.4.1.3 Variance Reduction: Control Variates 


Another way of reducing the variance is to subtract a function y from f where 
S[p(X)] = fa P(x) p(x) dx is known, Then E[f(X)] =E[f(X) — y(X)]+ 
{ [p(X)] and itis E[ f(X) — y(X)] that is estimated using the Monte Carlo method. 
If ~ is chosen to also be a good approximation to f over {x | p(x) > 0}, then the 
variance can be greatly reduced, as can be seen from (5.4.5): Var [ f(X) — y(X)] < 
; lf - yll2,. Provided f is smooth, we can use an appropriate interpolation or 
approximation scheme, such as radial basis functions (see Section 4.4), which can 
use scattered interpolation points. 


5.4.2, Quasi-Monte Carlo Methods 


Quasi-Monte Carlo methods [47, Secs. 5 & 6] aim to achieve integration errors 
of size O((Inn)/n) rather than O(1/./n) as the number of function evaluations 
n — ©. Quasi-Monte Carlo methods replace random (and pseudo-random) number 
sequences with deterministic sequences that have better uniformity properties than 
random (or pseudo-random) number generators. Two of the better known quasi- 
random sequences are the Halton and Sobol’ sequences. These sequences were 
first discovered by number theorists but have found application in high-dimensional 
numerical integration. 

Sobol’ and Halton sets are illustrated in two dimensions by Figure 5.4.4. In 
Figure 5.4.4, we show 1000 points from two-dimensional Sobol’, Halton, and pseudo- 
random (using Matlab’s MersenneTwister generator). 
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1 


(A) Sobol’ sequence (B) Halton sequence 


0 0.2 0.4 0.6 0.8 1 
(C) Pseudo-random sequence 


Fig. 5.4.4 Sobol’, Halton, and pseudo-random points in two dimensions 


0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 


It is evident from Figure 5.4.4 that the quasi-random points are generally “more 
regularly spaced” than the pseudo-random points. This is the essential nature of 


quasi-random sequences. 


The essential measure of the “quality” of a quasi-random sequence is dis- 
crepancy. Typically this is measured by how well they estimate the volume of 
rectangular regions. Let rect(a, b) = {x | a; < x; < b; foralli=1,2,...,d} be 
the rectangular solid with opposite vertices a and b ¢ R¢. Let I4 = rect(0, e) 
where e =[1, 1, ..., 1]” € R@ be the unit cube in R?. If we have a sequence 
X1, Xo, ¥3, ... € 14 then we can estimate the d-dimensional volume of a region 


S, voly(S), by counting the number of the points x; € S: 


i} 1<i<N&x;ES 
voly(S) © i | si a. i H for large N; 
i] 1<i<N &x; € S}| 


N 


Ry (S) = }vola(S) 
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The discrepancy for the sequence x), ¥2, ¥3,...€ 1 4 is the error in this estimate 
averaged over all rectangles rect(0, b) with b € 14: 


(5.4.8) Dx = sup Ry (rect(0, b)), 
bel4 
1/2 
(5.4.9) T* = i Ry (rect(0, b))* ab] : 
[4 


We can estimate the integral ie f(x)dx ® 1/N) yi J (x;). The error in this 
estimate can be bounded by the Koksma—Hlawka inequality: 


Theorem 5.4 Jf f: [4 > Ris smooth, then the error in the integral 


N 
en Lf1= | [ teras—aimy ren 


i=l 
is bounded by 


(5.4.10) en [fl < Dy Valfl where 
ptf 


d 
Du Oxy Oxy | +L Vea b=) 


j=l 


G41 val fl= | 
[4d 


Note that Vo [| f | = 0. 
The quantity (5.4.11) is called the Hardy—Krause variation. 
Proof Note that 


N 1 N 
I, f(x)dx — (1/N) )) fai) = y. : = Le = «| f(x) dx 
i=1 i=1 


where 6(z) is the Dirac 6-function (actually a distribution; see Appendix §A.2.1). 
Note that for g: [0,1] ~ RforO0 <u <1, 


1 1 1 
i Sx — w gtaydx = gw) = gt) — f ede = a0) - f H (x — u) g(x) dx 


where H: R — R is the Heaviside function: H(z) = 1 if z > 0 and A(z) = 0 if 
z <0.Forg: [0,1]? > R, 
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1 1 
i. [ d(x1 — uyp)O(x2 — U2) 9 (1, x2) dx] dx2 
1 1 
=} d(x1 — wf 5(xq — uz) (x1, x2) dx2 dx, 
1 1 1 Og 
= [ d(x2 — u2)g(1, x2) dxo — [ A(x, - uf 6(x2 — ura @1+x2) dx2 dx1. 
0 0 0 X1 


Writing 


i Blan — ua) SE cn, Sa seca. 1)- [ H (9 — ua) (a, 9) dg and, 
i =e oda »-f Fecg tata: 
0 0 XQ 
we get 
1 1 
i i d(x — U1)d(x2 — U2) g (x1, X2) dx1 dx 
0 0 
1 
a 
294, yf HG ~ 0)" (am) dx 
0 X2 
il 
a 
-f[ Hx — w) >= (a1, Deda 
0 xX] 
1 1 Og 
+f / A(x, — wu) A (x2 — u2) ——-— (41, X2) dx2 dx). 
0 JO 0x20 


Generalizing to d dimensions, for z € IR? we define H(z) = Tt, H(z;). Also, for 
BC {1,245 ..,4) withB= {ivi ig} 


dll g 2 ag 
Ox Ox;, Oxi, ene . OX;, : 
Fg ={xe€[0,1]¢|x;=1forjeB}. 


and 


Then 
(5.4.12) 


/ d(x — u) g(x) dx = is ey H(x <8 x) dx. 
(0, 1]¢ 


nee Digaties 


ala 
Note that if B = @, we have A(x — wu) 5 $ (x) dx = g(,1l,..., 1). Also, if 


Fz 
B= ({1,2,...,d}, 
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|B id 
esas” Six) dx = H(x —u) as 


dx. 
Fp Ox [0,114 Ox, Ox ten Oxa () . 


On the other hand, 


/ g(x) dx = d, d(x — u) g(x) dxdu 
(0, 1}4 [0,1]¢ J [0,1] 


ols 
= f H(x —u)d d 
Ys (D ba bis (x — u) 8x) x. 


BC{1,2,...,d} 


Here So 14 H(x — u)du = voly ({u | 0 < x < u}) where “a < b” means “a; < b; 
fori = 1,2,...,d”. Since {u | 0 <x < u} = rect(0, x), this gives 


B\ 
(5.4.13) [s@ar= Yo ni i volu(reet(0, x)) 5 : 


(x) dx. 
BC{1,2....,d) 


The formulas (5.4.12, 5.4.13) mean that the integration error for g is 


dx — — 
Dod (x) dx 1 ee) 


Noe 1 
|B] 
= > if ita)" eae 
F, OxB 
BC{I,2,....d} i 
+ yf i H : 
we — Fp > (x — ie) x 


IBl g 
Sy am f vol, (rect(0, x)) — SSoemres a = (x) dx. 


BC{1,2,....d} B j=l 


The quantity Yi H (x — x ;)isthenumber of j where x > x ;; thatis, ae A(x — 
x)= {i | x; € rect(0, x)} i The discrepancy 


Dy = max 
xe[0, 1] 


vol, (rect(0, x)) — ~ Le |x; € rect(0, »)| 


can now be used to bound the integration error: 


|B lg 
2 dx. 


s(x) 


N 
1 
g(x)dx —-— )  g(xj;)| < Py ole 
I. N 2, : - Fp 
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It can be shown that }?pcy1.9 
ing (5.4.10), as we wanted. 


The bound 


ay Sr, |O71g/Ox8(x)| dx = Va Lf] in 6.4.11), giv- 


en [g] = i g(x) dx — 153 < DyVa lg] 
. (0, 1]1¢ N amc 


j=l 


on the integration error splits the error into a part that depends only on the points 
X1, X2,..., Xy and a part that depends only on the regularity of g. In practice, 
the value of Vz [g] tends to give great overestimates of the error, while for Halton 
and Sobol’ points, the integration error seems to behave like O(Dj,) as N > oo 
[47, p. 26]. So it is natural to focus on finding sequences x;, ¥2, ... where the 
discrepancy decreases rapidly. Both Halton and Sobol’ sequences in R@ have Dy = 
O(N! (log N)“) as N > o giving ey [g] = O(N! (log N)“) as N > oo. On the 
other hand, if the points x ; are sampled randomly and independently from a uniform 
distribution of [0, 1]¢, 


1/2 
{Len Lg ]] = iff pe) = was | NOW, 


where g = Sro.1ye g(x) dx [47, (5.10) and (5.11)]. 

Halton sequences [189, pp. 29-33] are generated by using the one-dimensional 
van der Corput sequences: given a base b and positive integer n, we write n = 
(hphp_1 +++ hy ho)p = hpb® + hy_b*—! +---+hyb+ho where the integers 0 < 
hj; < b for all j. Then the nth van der Corput number in base b is 


(5.4.14)  vde(n, 6) = O.hohy ++ hy php) = hob! + hyb-? +++ + Aged * + hgh, 
Halton sequences in [0, 1]¢ use d different bases b;, bo,..., bg and 
(5.4.15) halton(n, [b), bo, ..., ba]) = (vde(n, b,), vde(n, bo), ..., vde(n, bg)). 


Correlations between the different components of Halton sequences, especially near 
the beginning of the sequences, have been observed even if the bases b,, bo, ... 
are distinct primes. For this reason, initial segments of a Halton sequence are often 
dropped, and after this, only every mth element in the Halton sequence used. 

Sobol’ sequences [235] of points x; € [0, 1]? are generated separately through 
sequences for each j as follows. The basic algorithm is outlined in [29]. For each j 
we designate a distinct irreducible polynomial pj(z) = )-)_» 4 een" (mod 2) for 
each j where r = deg p; = s; with ajo = 1. We define a © b for integers a and b 
in terms of bits: bit(a ® b, i) = bit(a, i) + bit(b, i) (mod 2). This can be extended 
to dyadic fractions (u/2? with u, g € Z) by allowing bit(a, i) to have negative input 
i. Note that irreducibility implies that p;(0) = a;,, 4 0, so that aj, = 1 (mod 2). 
The jth component of x; is given by 
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(5.4.16) (xz); = (€ovj,0) B (€10;,1) B+: B (Emvj,q) 


where k = (€,@,_1 -- + €1@0)2 is the binary representation of k. The quantities v; ¢ 
are binary fractions given by v;,¢ = mj,¢ /2°*!; the integers m j,¢ are given by the 
recurrence 


(5.4.17) = mje = (2a;mje-1) ® (2° am j,0-2) @®-+- ® (2’a-mje-+) ® Mje—r 
where r = s;. Note that the dyadic fractions v;,¢ satisfy the recurrence 
(5.4.18) Uj¢ = (a1 0;,e-1) ® (a20;,e-2) B+ ++ B (a- Uj; e+) B (vj e+ /2’). 


Sobol’ sequences can be more efficiently computed [6] using Gray codes. Gray 
codes are a way of generating the number 0, 1, 2,..., 2” — 1 where consecutive 
numbers in the sequence differ by exactly one bit. For example, for m = 3, the Gray 
code is 0 = (000)2, 1 = (O01)2, 3 = (O11L)2, 2 = (010)2, 6 = (110), 7= C111 )2, 
5 = (101)2, 4 = (100)2. A simple formula for the Gray code of n is Gin) =n ® 
[n/2] where |x] is the largest integer < x. Note that if n is a non-negative integer, 
then bit([n/2] ,7) = bit(n,i + 1); that is, n  |[n/2] shifts the bits of n right by 
one place. By re-ordering the Sobol’ points using the Gray code, we obtain a more 
efficient implementation: 


Xj,GR+) = Xj,Gh) B Vj,c(k) 


where c(k) is the index of the only bit where G(k + 1) and G(k) differ. 

Provided we choose the initial values m;,¢ for 0 < € < s; — 1 to be odd, then 
every mj;,¢ is odd. Also we require that mj,¢ < 2°. This ensures that there are no 
repetitions in the numbers x; ; as k = 0, 1, 2,.... 

Both Halton and Sobol’ points x; € [0, 1]? have the property that the discrep- 
ancy D*, = O(N~!(log N)“). Other sequences have the same asymptotic order 
of discrepancy. Faure sequences have smaller discrepancies than either Halton 
or Sobol’ sequences, but with the same asymptotic order. There is a theorem of 
Roth [219] that for d > 2 and any infinite sequence, Dx, is bounded below by 
Dx, = Q(N~! (log N)“/?) as N > 00. 

In spite of the good asymptotic properties of the discrepancy Dx, as N — oo 
for Halton, Sobol’, and Faure sequences, the value of N needed to approach this 
asymptotic behavior grows exponentially in the dimension d. Specifically, we need 
N > 2° to start seeing better behavior using quasi-Monte Carlo methods than using 
random or pseudo-random sequences [181]. The effectiveness of quasi-Monte Carlo 
methods in high dimensions is the subject of the paper by Sloan and Wozniakowski 
[234]. In this paper, the authors consider families of functions f (x1, x2,..., X@) with 
norm 
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: 1/2 
dxy where 


awl ¢ 


Oxy 


(x) 


—1 
lya=| 2% | 
UC{I,2,....d} (0, 11¥ 


w=|][w xe=byl|feUl, with m=1 ifk¢U. 
jeu 


Thus 7; : represents the roughness of f’s dependence on x;. If sg(y) := aan ae 
is bounded as d — oo, the number of function evaluations by a quasi-Monte Carlo 
method needed for an error of < € is independent of d and polynomial in €~!; if 
Sa(y)/ Ind is bounded as d — oo the number of function evaluations by a quasi- 
Monte Carlo method needed for an error of < € is polynomial ind and e~!. Otherwise 
the bounds are exponential in d, no matter the quasi-Monte Carlo method used. Since 
these are bounds, they do not cover all possible cases in which quasi-Monte Carlo 
methods are successful, but the results of [234] certainly give good guidance as to 
when quasi-Monte Carlo methods are most successful. 


Exercises. 


(1) Try out the Buffon needle problem of Section 7.4.1.3 with the length of the 
needle £ equal to the spacing s between the lines. Use N = 2" samples from a 
pseudo-random number generator for k = 1, 2,..., 20, and plot the error against 
N. What is the empirical estimate of the error in the form error ~ C N~°? 

(2) Repeat the previous Exercise using Halton numbers instead of using a built-in 
pseudo-random number generator. 

(3) Repeat the Buffon needle problem of Exercise 1, but now using antithetic variates 
(5.4.6) in the angle at which the needle lies to reduce the variance. 

(4) A generalization of antithetic variates (5.4.6) for a non-symmetric probability 

density function p(x) in one dimension is to use the cumulative distribution 

function F(x) = i p(t) dt. At each step generate a random sample U from 

a uniform distribution on [0, 1], and then compute s(f(FO! (U))+ fF ld 

U))) instead of 5(f(X) + f(—X)) for the symmetric case. Show that F~'(U) 

and F~!(1 — U) both have probability density function p(x). Apply this version 

of antithetic variables for variance reduction to estimating E [X] where X has 
probability density function p(x) = xe for x > 0. Use n antithetic pairs of 

samples for n = 2" k=1,2,...,20. Plot the error in the estimate for E[X] 

against n using a log—log plot. Compare with estimates without using antithetic 

variables. 

Consider the problem of estimating Sina p(x) f(x) dx where p(x) is the prob- 

ability density function for a Gaussian distribution with mean yz and variance— 

covariance matrix V: p(x) = (2m)~4/?(det V)~!/? exp(—5(x — pl Vl 

L4)). Suppose that Vo; = A;v,; with ||; |, = | gives an orthonormal basis of 

eigenvectors of V. Show that the 2d + | point formula 


(5 


wm 
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I< +5)v;)—2f (uw) + f(u— sj; 
[ p@reae~ ryt 5 $j¥j) a f (us Dy, 
j=l j 


is exact for cubic functions f provided s; 4 0 for all j. [Hint: Use the change 
of variables f(x) = g(y) where x = yx+[v1,..., vg]y, and use symmetry to 
show that the integrals with g(y) = yz, yee, vie yee, and yz yey, Withk ALA 
p # kare all zero.] 

(6) A Following the previous Exercise, create a method for estimating dix p(x) 
f (x) dx that involves O(d ?) function evaluations and is exact for all polynomials 
f of degree < 5. [Hint: Focus on g(y) = yt and yey? fork 4 £.] 

(7) Inthis Exercise, we create a pseudo-random function f (x)= y 4 Qj cos(kj x + 
y;) with a; uniformly distributed over [0, 1/771, kj; € {0, 1, 2, 3} chosen 

independently with probabilities Z, i, i, :, and w; uniformly distributed over 
[0, 27r]. Show that the exact integral is (27r)4 vies a; where J = { J | k;=0}. 
Use quasi-Monte Carlo integration (choose between Halton, Sobol’, and Faure 
sequences) to estimate Stony J (x) dx for a specific function generated this way, 
for N = 100 and d = 2, 4, 8, 16, 32. Plot the error against n, the number of 
samples for n = 2", k = 1,2,...,20. Use a log-log plot. How does the error 
behave for increasing d? 

(8) Repeat the previous Exercise using pseudo-random number generators instead 
of quasi-Monte Carlo generators. How does the error behave for increasing d? 

(9) A\ In this Exercise, we aim to generalize the approach of antithetic variables 
for probability distributions that are invariant under a group of transformations 
of R¢. In (5.4.6), it is assumed that the probability density function p(x) is 
symmetric under the transformation x +> —x, so that p(—x) = p(x). Instead, 
suppose that there is a group of transformations G = { 7: Ri > R4|yeG } 
under which p(x) is invariant: that is, p(y(x)) = p(x) for each x andy € G. 
Assuming G is finite, instead of computing 5(f (X) + f(—X)) for each sam- 
ple X, we compute IG\~! peewee FS (y(X)) for each sample X. Further assume 
that each + is represented by an orthogonal matrix U,: y(x) = U,x. Using the 
analysis techniques of Section 5.4.1.2 or otherwise, determine the reduction of 
the variance for smooth f/f. 


5.5 Numerical Differentiation 


Computation of derivatives is useful for many applications, such as ordinary and 
partial differential equations and computing gradients for optimization. Most of this 
section will be focused on using discrete values (x,, f(x;)) to estimate f’(x) at 
some x. However, in optimization, we still want to compute gradients of known 
functions. These functions may be computed by computer code that is complex and 
where symbolic computation is impracticable. For these situations, there are methods 
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referred to as automatic differentiation, which can transform a given code into a new 
code that can compute not only the function values as the original code did but also 
the derivatives in an efficient way. 


5.5.1 Discrete Derivative Approximations 


In this section, we consider approximations of the form 


m 


Pen 5 Ye fe+hgp or 


These use discrete function values f(x +h€;), often at equally spaced points 
because of the context in which they are used. The derivation and analysis of these 
approximations often proceed via interpolation error estimates, or Taylor series with 
remainder. 


5.5.1.1 One-Sided and Centered Differences 


The simplest formula for differentiation, the one-sided difference formula, comes 
directly from 


FG) = jig L249 = FO) 


~ Leth =f) 


(5.5.1) h 


forh + 0. 


We can estimate the error most simply using Taylor series with remainder (1.6.1): 


1 
FE +h) = FO) + Oh + 5 f"eh’, 
for some c,between x and x + h. Then 


f(x +h) — f@) 


1 
h = f(x) + xf cayh, 


and the error is O(h). 


374 5 Integration and Differentiation 


Fig. 5.5.1 Results of 10° 
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The centered difference rule can also be derived using Taylor series with remain- 
der: 


1 1 
fe th) = f@)+ f@At sf") h? + f(a) h, 
1 1 
fa-h=f@—f'i@at af" yh - af ah’, and so 
1 
fx+h)— f(x-h)= 2f'(x)h+ a (E(w) + fda) P°. 


That is, 


f(x +h)— fa —h) 
2h 


1 
(5.5.2) = fi(x)+ aL 


/ 1 m m 2 
= f(x) + D (f'n) +f" (dn)) h 


for some ¢, between x — h and x + h. This gives an error of O(h?). 

If these are used for computing, for example, gradients of a given function, we 
can choose h. Clearly, we get smaller error in our estimates for smaller h. So why not 
make / as small as we can get away with? Why not make /: a fraction of unit roundoff? 
This should give derivatives that are accurate to unit roundoff. Unfortunately, this 
does not work. The analysis here assumes that the arithmetic is exact. 

Figure 5.5.1 shows the error in computing the derivative of f(x) = e*/(1 + x) at 
x = | using (5.5.1) and (5.5.2) for different values of h. 

For larger values of h (on the right side), we can clearly see steep reductions in 
the error as h is decreased (going left) until a critical point where further reducing 
h seems mainly to increase the error, but in an erratic way. The erratic behavior 
of the error is indicative of roundoff, which is exactly the source of the problem. 
Using the formal model of floating point arithmetic (1.3.1), we can roughly bound 
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the roundoff error (assuming that f is well-implemented) by 2u|f(x)|/h for the 
one-sided formula and u| f(x)| /A for the centered difference formula where u is 
the unit roundoff for the floating point arithmetic. The total error can then be roughly 
bounded by 


1 2 

5 | f ae: )| |h| + a for the one-sided formula, and 
h 

1 

6 | f(x) | h? + = = for the centered formula. 


Minimizing these bounds gives recommended values for h: 


1/2 
h* = 2u!/? a . for the one-sided formula, and 
x 
1/3 
h* = 3'Fy!2 a for the centered formula. 
x 


As a rough order of magnitude estimate, we should take h* ~ u!/? for the one-sided 
formula and h* ~ u!/? for the centered formula. For double precision, these are 
roughly 10~® and 107°, respectively, corresponding to the minimum values of the 
error in Figure 5.5.1. 

If we are using one of these methods for estimating gradients V f(x) for f: R’ > 
R, we need n + | function evaluations using the one-sided difference formula and 
2n function evaluations using the centered difference formula. 


5.5.1.2 Higher Order Methods and Other Variants 


By expanding by Taylor series with remainder, we can obtain higher order methods. 
If we start with the centered difference formula 


f(x +h) — f(x —h) 
2h 
f(x + 2h) — f(x —2h) 
4h 


= 9 Gy si") We +O(h), — then 


4 
=f rere) h? + O(n"). 
Combining them with 


fa t+h)—-— f(x -h) f(x + 2h) — f(x — 2h) 
a + (1 —a) 
2h 4h 


= f'(x)+ E +(1- ays f(x) h + O(n), 


we get a fourth order method if we choose a so that [a + 4(1 — a)] /6 = 0; that is, 
a = 4/3. Then 
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G53) 
Pig —f(x+2h) +8 f(x +h) —8 f(x —h) + f(x — 2h) 


h’). 
12h TO) 


More generally, we can use interpolation to estimate derivatives: if p, is the 
polynomial interpolant of degree < n with interpolation points xo, x1, ..., Xn, then 
from (4.1.6), 


F(X) = Pn) + FLX, X05 X1, «++ Xn) — x0) — ¥1) +++ — Xn). 


Taking derivatives, 


d 
f'X) = Pi, @) + 7 (FL, X0, X1, +++) Xn](® — Xo) — x1) +++ (&% — Xn)) 


= P(x) + FLX, x, X0, X1, ++, Xn) — Xo) — x1) +++ & — Xn) 


+ fIX, x0, %15 +. Fn) ), I] (x — Xp). 


j=0 kikéj 


If we fix the pattern of the interpolation with x; = a +h €; we get 


f'@ = pia) + fla,a,athf,...,a + hE day TT] 
j=0 


(5.5.4) + fla,ath&,...,at+hé)h"(-1)" >> |] &. 


j=0 kihkéj 


If Yj=0 Tex 4i &, 4% 0 then we get O(h") error in the derivative. However, some- 
times we can get an improvement of the order by one if Yj=0 Tex hj & = 0, which 
is the case for the centered difference method: & = —1 and €; = +1. It is also 
the case of the fourth order method (5.5.3) with € = [—2, —1, +1, +2]". It might 
appear that we can get zero error by making TTi-0 €; =O and Yj=0 There) & = 9- 
To do so, we would need €; = 0 for two values of j, which means that we would be 
using a version of Hermite interpolation, interpolating f’ (a). This of course requires 
knowing the value of f’(a), which is what we are trying to compute. 


5.5.1.3 Higher Order Derivatives 


Second order derivatives can be estimated numerically, and the best known method 
for doing this is the three-point stencil: 


FO +h) ~2 FH) + (eh) 


(5.5.5) f(x) © 73 
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We can analyze this method by using Taylor series with remainder: the important 
point is to make as many low-order terms cancel except for the desired f” (x): 


_ / 1 " 2 i m 3 1 (4) 4 
FE +A) = fA)+ SON + SION + ZK" COW + FO (Enwh 
JOT) 
fa-h)=f)—f'@at © pM (xyhe - 2 xm - + fc yh* 
2 6 24 alee 3 
SO 


fa@th)—-2f@M+fa—h 
h2 


= f"(x)+ 4 [f° (cen) + FO (cx,-n)] h? 


giving an error of O(h7). We can create a fourth order method by combining this 
with the 2h three point stencil: 


f(x + 2h) —2 f(x) + f(x — 2h) 
(2h)? 


1 
=F") + 5 [FO Cx20) + £ x,—20)] OM)”. 


The formula we can use is 


4 f(x +h)—2 f(x) + f(x—h) 1 f(x +2h) —2 f(x) + f(x — 2h) 
3 2 3 (2h)? 


=f" (x) + 0(n4), 


as can be verified by using Taylor series with fourth order remainder. That is, 


(5.5.6) 


f'@= — f(x + 2h) + 16 f(x +h) — 30 f(x) + 16 f(x —h) — f(x — 2h) 


12 h2 


+ O(n’). 


For differentiating-the-interpolant approaches, we can estimate the error in second 
derivatives in a way similar to (5.5.7): 


2 


d 
f(x) = py) = Fat Fe HO: Fives Mel — Xo)(% — ¥1) +++ @ — Xn) 


n 
= F(x, x, xX, X09, %4, via 3a | Ge — xj) 
i 0 


n 


+2 flx, x, x0, %1,-.., 40] >) I] (x — x;) 


k=0 j:jtk 


+ f[x, x0, %1,---,Xn] x I] (x — xj). 


kAl=0 j:jFk,e 
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Putting x; = a + h €; for our interpolation pattern, 


f"(@) — ph(a) = fla,a,a, xo, x1,...,%)h"'(-1)""" | [ & 


j=0 


+2 fla,a,xo,x1,--.,XnJh"(-1)" > I] &} 


k=0 j:j#k 


+ fla, xo,%1,...,%n)h" I"! DO TT &. 


kAl=0 jij F#k,e 


For the standard three point stencil we have n = 2, &) = —1, €; = 0, and &; = +1. 
In this case iee=0 Tlhjzn.e €£ =f+6+ & =0but 


n 


> I] &) = 081 + S082 + 182 = &ob2 = —1 


k=0 j:jtk 


giving O(h") = O(h’). 

In applications to optimization, we might want to compute second derivatives 
to estimate the Hessian matrix Hess f(x) = [a° f/OxpOXe (x)],. y—, for a function 
f: R” — R. The formula (5.5.5) can be used to compute the diagonal entries of 
the Hessian matrix 6? f/ Ox? (x). The mixed derivatives 07 f/Ox;,Ox,(x) require a 
different approach: 


(5.5.7) cae (x) 
we Sane aie 


[f(x + he, + her) — f(x — he, + he) 


— f(x + he, — her) + f(x — he; — he,)). 


Using multivariate Taylor series with fourth order remainder, we get cancellation of 
all terms inside the square brackets except the 0? f/Ox,0x¢(x) term and the O(h*) 
terms. This gives an error in the formula (5.5.7) that is O(h7). Thus the entire n x n 
Hessian matrix can be estimated to within O(h7) using 2n? + 1 function evaluations. 


5.5.2 Automatic Differentiation 
Ifa function is given, even implicitly, by a formula, then we can compute its derivative 
symbolically. The usual ways of doing this are: 


e “by hand” where someone who knows calculus performs the symbolic calcula- 
tions by pen/pencil and paper; or 
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e the formula is first determined (whether “by hand” or by some automated pro- 
cess), and then the derivatives are computed by a symbolic mathematics package, 
such as Axiom, Maple, Mathematica, Maxima, or SageMath. 


Neither of these approaches is very satisfactory, especially when the function con- 
cerned is not given directly as a formula, but rather implicit in some (perhaps large) 
piece of software. Automatic differentiation, also known as algorithmic differentia- 
tion or computational differentiation, takes a more computational approach based on 
the basic rules of calculus with special attention paid to the chain rule [111, 186]. 


5.5.2.1 Forward Mode 


Forward mode is the simplest approach for automatic differentiation, both concep- 
tually and in practice. This idea is sometimes implemented as dual numbers in pro- 
gramming languages that allow overloaded arithmetic operations and functions. A 
dual number is a pair x = (x.v, x.d) where x.v represents the value of the number, 
and x.d its derivative with respect to some single parameter, say dx /ds. Ordinary 
numbers are treated as constants, and so are represented as (v,0) where v is the 
number. 
Operations on dual numbers x and y can be described as 


xX+y=(x.u+y.v, x.d+y.d), 
x—y=(x.u—y.v, x.d — y.d), 
X-y=(x.u-y.vu, x.v-y.d+x.d-y.v), 
x/y = (x.v/y.v, (x.d+y.v —x.v- y.d)/(y.v)), 
f(x) = (f(x.v), f’(x.v) - x.d). 


This can be extended to handle higher order derivatives, such as triple numbers x = 
(x.v, x.d, x.c) where x.d = dx/ds and x.c = d*x/ds”. Then for triple numbers, 
for example, the arithmetic rules include 


X-y=(x.v-y.v, x.v-y.d+x.d-y.u, x.u-yc+2x.d-y.d+x.c-y.v), 
fx) =(fx.v), f'(x.v)-x.d, f'(x.v) x. + f(x.) (x.d)’). 


The derivatives computed would be exact if the underlying arithmetic were exact. 
Thus the only errors in the computed derivatives are due to roundoff error. This does 
not guarantee accurate results, but they rarely fail. 

Forward mode automatic differentiation is suitable where there is one, or a small 
number, of independent variables with respect to which we wish to computed deriva- 
tives. If we wish to compute gradients for many inputs, we need a different method. 
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The reverse mode of automatic differentiation is best suited to compute gradients 
of a single output function with respect to many inputs. The basic idea has been re- 
discovered multiple times that we know of, but the modern approach can be traced 
back at least to Seppo Linnainmaa in his PhD thesis that was later published [163]. 
For this, we need to conceptually flatten the execution of a piece of code so that it is 
written as a “straight-line code” with branches and loops removed. For example, the 
loop 


for i=1,2,...,4 
x < f() 


end 
should be written as it is executed: 


x1 <— f (x0) 
x2 <— f(x) 
x3 <— f (x2) 
X4 <— f (x3) 


66 


The index j in x; indicates a potentially new value for the variable “x” for each pass 
through the body of the loop. 

In reverse mode automatic differentiation, this execution path and the values of 
variables along this path must be saved, at least at strategically important points of 
the execution of the original code. This can be represented in a computational graph 
of the execution of the code. Note that in the computational graph, each variable 
must only be assigned a value once. If a value of a variable is over-written, then we 
create a new variable for the computational graph, as shown in the example of the 
loop above. 

The code 


U<Ir-:s 
v<r 
x <— p(u, v) 
youer 


can be represented by the computational graph in Figure 5.5.2. 

We compute the partial derivatives Oy/Oz for z each of the variables in the 
computational graph as we go back through the computational graph. If we con- 
sider the last node of the computational graph and no prior operations, then 
Oy/Oy = 1 and Oy/0z = 0 for all other variables z. Consider the last operation: 
y <x -r. Considering this operation alone, we have 0y/Ox = r and Oy/Or = x. 
Now include the second last operation: x < y(u, v). Taking this into account, 
we find that 0y/Ou = (Oy/Ox)(Ox/Ou) = (Oy/0x)(Op/Ou(u, v)) while Oy/Ov = 
(Oy /Ox)(Ox/Ov) = (Oy/Ox)(Ov/Ov(u, v)). We have Oy /Ox from before this oper- 
ation (it is equal to r) so we can compute Oy/Ou and Oy/Ov. 
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Fig. 5.5.2 Computational 
graph 


The previous operations are u <—r-s and v <r*. We already have a value 
for Oy/Or that is non-zero that does not take these operations into account. This 
should not be over-written. Instead, we add (Oy/Ou)(Ou/Or) and (Oy/Ov)(Ov/Or) 
to Oy/Or. Similarly, Oy/Os = (Oy/Ou)(Ou/Os) + (Oy/Ov)(Ov/Os). 


To see how the general rule works, consider 


// position t 
WwW <— Wr, 5, U, v) 
// position t+1 


Suppose we have values for 0y/0Oz for every variable at position tf + 1 in the code. 
At position ¢ we update these partial derivatives via 


7) ) 3) ) 
(5.5.8) si mee 2) (oon) (pees 

Oz Oz Ow Oz 
for each variable z on the right side of the assignment. To see this more formally, 
suppose that starting at position ¢ + 1 we have y = fi41(7, 5, u, v, w). Then at posi- 


tion t we have y = f(r, 5, u, v), since w is only assigned to once, and so does not 
have a value at or before position t. Then 


y=f,s,u,v) = fisiG, 5, u,v, WW, S, U, v)) and so 
Of, = Oft+i + Oft+1 Ow as Oft+i ap Oft+1 Ow i 
Or Or Ow or) Or Ow Or }’ : 
This justifies the update rule 
Oy Oy Oy Ow 
< + 
Or Or Ow Or 


and so on, for the other variables s, u, and v. 
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For one output y, the number of operations to compute all the derivatives Oy /0z 
for all input variables z is a modest multiple of the number of operations needed to 
compute y from the inputs once: oper(V y) = O(oper(y)) where oper(y) is the num- 
ber of operations needed to compute y from the inputs. Theoretically, this multiple 
is no more than five, provided only standard functions and arithmetic operations are 
allowed. In practice, the time needed for straight-line code is perhaps 20 because of 
the additional overhead in setting up, tracking, and recording the variables as we go 
through the computational graph. The biggest difficulty is that the amount of memory 
needed to store intermediate values of computational variables can become large. 

As with the forward mode of automatic differentiation, the only errors are due to 
roundoff errors. 


5.5.2.3 Use of Automatic Differentiation 


Automatic differentiation is such a wonderful technique, there is tendency to apply it 
indiscriminately. Some recent work, such as [131], can seem to promote this point of 
view. However, automatic differentiation is not infallible. To illustrate this, consider 
using the bisection method to solve f(x, p) = 0 for x: the solution x is implicitly a 
function of p: x = x(p). Provided for p © po we have f(a, p) < Oand f(b, p) > 0 
for given fixed numbers a < ), bisection will give the solution x(p) for p © po. 
However, in the bisection algorithm (Algorithm 40), we first look at c = (a + b)/2 
and evaluate f(c, p) and use the sign of this function value to determine how to 
update the endpoints a and b. Since a and b are constant, 0a/Op = Oa/Op = 0, and 
so Oc/Op = 0. Continuing through the bisection algorithm we find that the solution 
returned has 0x*/Op = 0. Which is wrong. 
From the Implicit Function Theorem we have 


af dx Of 
ag Pon ape so 


Ox _ Of Of 
dp =— (Shc.m) / (Sho.n) . 


Once the solution x(p) is found, we can find the derivatives 0f/Op and 0 f/0Ox 
using automatic differentiation. We can then compute 0x/0p using the above for- 
mula, regardless of how x(p) is computed. In a multivariate setting, the compu- 
tation of derivatives of the solution x(p) of equations f(x, p) = 0 with respect 
to a parameter p will involve solving a linear system of equations: V,x(p) = 
—Vi f(x, P)'Vof (x, P). 

Automatic differentiation is also heavily used in machine learning and neural 
networks. The main neural network training algorithm backpropagation is essentially 
an application of the main ideas of automatic differentiation [18] combined with a 
version of gradient descent. 


0= 
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If gradients V f(x) can be computed in O(oper(f(x))) operations, what about 
second derivatives? Can we compute Hess f(x) in O(oper(f (x))) operations? The 
answer is no. Take, for example, the function f(x) = (x? x)*. The computation of 
f (x) only requires oper(f(x)) = 2n + | arithmetic operations. Then 


V f(x) = 4(x" x) x, 
Hess f(x) = 4(x'x)I1+8xx". 


For general x, Hess f(x) has n> non-zero entries (n(n + 1) independent entries), 
so we cannot expect to “compute” Hess f(x) in O(n) operations. 
However, we can compute 


Hess f (x)d = [4(x"x) 1+ 8xx"]d 
= 4(x"x)d + 8x(x"d) 


injust 7m + 2 = O(n) arithmetic operations. In general, we can compute Hess f (x) d 
in O(oper(f (x))). We can do this by applying the forward mode to compute 


(5.5.9) 7 V f(x + sd)|,-9 = Hess f(x) d 


where we use the reverse mode for computing V f(z). 
Exercises. 


(1) If f: R4 — Ris smooth, how many function evaluations are needed to estimate 
V f(x) using the one-sided difference formula (5.5.1)? How many function 
evaluations are needed if the centered difference formula (5.5.2) is used? 

(2) Use one-sided (5.5.1) and centered differences (5.5.2) to estimate f’(7/4) for 
f(x) = (cos x)/+/1 + x using spacing h. Plot the errors against h for h = 4-*, 
k=1,2,...,10. 

(3) Repeat the previous Exercise using the 4th order symmetric difference method 
which uses the function values f(x +h) and f(x + 2h). 

(4) Develop a formula to compute 0? f/Ox Oy(x, y) using the function values 
F(x, y) and f(x +h, y +h) with all possible choices of signs. 

(5) Extend the previous Exercise into a method for computing the Hessian matrix 
Hess f(x) = [07 f/Ox;,0x¢ (x)]¢ p— of 2nd order derivatives. How many func- 
tion evaluations are needed? 

(6) In many machine learning systems, there are weight matrices to be optimized. 
Suppose that we seek to minimize h(W, b) := g(Wu + B) given u over all 
possible values of W and b. We define Vwh(W, b) to be the matrix of partial 
derivatives Oh /Owxe(W, b). Compute Vwh(W, b) and V,h(W, b) in terms of 
Vg(Wu + b) and wu. [Hint: First do this in terms of the entries of W, then com- 
bine these expressions into a concise formula using matrix—vector operations. ] 
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(7) 


(8 


wm 


(9) 


(10) 
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Given a collection of data points (x;, y;),i = 1,2,..., N, we wish to compute 
the gradient of p(w) := (2N)7! = 4 (i — g(x w))*. Let m be the dimension 
of w. Given that there is efficient code for computing V,g(x; w), give an 
efficient method for computing Vy(w). Given that there is efficient code for 
computing Hessyg(x; w)d given x, w, and d € R” using only O(oper(g) + 
m) operations, create a method to compute Hess y(w) d given w and d that 
uses only O(N oper(g) + N m) operations. 

Backpropagation is a well-known technique for neural networks for computing 
gradients for these functions. Suppose that the cost for a given output u of a par- 
ticular layer of the network is y(u). Now suppose that u = map(o, Wv + b) 
where y = map(f, x) is given by the formula y, = f (x;). Starting with Vy(a) 
for u = map(o, Wv + Dd), give efficient formulas for V,»y(map(o, Wv + b)) 
as well as Vwy(map(o, Wv + b)) (see Exercise 6 for matrix derivatives) and 
Vay(map(o, Wv + b)) using o’. 

Use Exercises 7 and 8 to implement a backpropagation method for computing 
the gradient of (y — ¥)* with respect to z, c, and each WY) and bY, j = 
1,2,...,m, where 


y= z! map(o, Ww map(o, we" Dmap(- why + p.. y+ bY) 4 p™) +. 


Given methods for computing gradients of y(w, v) and the Jacobian matrices 
of F(u, v), given that V,,F (uw, v) is invertible, give a method for computing 
Vw(v) where w(v) = y(u, v) and F(u, v) = 0. Assume that for each v there 
is exactly one u satisfying F(u, v) = 0. Explain why this avoids trying to “dif- 
ferentiate the solver” where a nonlinear equation solver is used for computing 
w(v). Show that trying to “differentiate the solver” does not work when applied 
to the bisection method. 


Chapter 6 M®) 
Differential Equations ra 


Differential equations provide a language for representing many processes in nature, 
technology, and society. Ordinary differential equations have one independent vari- 
able on which all others depend. Usually this independent variable is time, although 
it may be position along a rod or string. Typically in these situations, the starting 
position or state is known at a particular time, and we wish to forecast how that will 
change with time. These are initial value problems. In other cases, partial values 
are known at the start and at the end, and the differential equation describes how 
things change between the start and end times. These are known as boundary value 
problems. 

Partial differential equations have several independent variables, usually repre- 
senting spatial co-ordinates, and sometimes including time as an additional indepen- 
dent variable. Where a partial differential equation has spatial as well as temporal 
independent variables, often the problem is discretized with respect to the spatial 
variables, leaving an ordinary differential equation remaining for the spatially dis- 
cretized variables. 


6.1 Ordinary Differential Equations — Initial Value 
Problems 


The form of problem we consider here is, given f': R x R” —> R” and xo € R", to 
find the function x(-) where 


6.1.1 ae = 
(6.1.1) We =f(t,x), x(t) =Xo. 


This is the general form of an initial value problem (IVP). By “finding the function 
x(-)” computationally we actually mean “finding x, * x(t), k = 0,1, 2,3,... for 
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t,’s sufficiently close together’. Interpolation can then be used to give good approx- 
imations for x(t) where ¢ is between the f;’s. 


6.1.1 Basic Theory 


We start with an equivalent expression of the initial value problem (6.1.1): 


(6.1.2) x(t) =xo+ / f(s, x(s))ds  forallt. 


Peano proved the existence and uniqueness of solutions to the initial value problem 
using a fixed point iteration [200] named in his honor: 


t 
(6.1.3) ist) = 0+ f f(s, xe(s))ds  forallt fork =0,1,2,..., 
iT) 


with xo(t) = Xo for all t. To show that the iteration (6.1.3) is well defined and 
converges, we need to make some assumptions about the right-hand side function f. 

Most specifically we assume that f(t, x) is continuous in (t,x) and Lipschitz 
continuous in x: there must be a constant L where 


(6.1.4) If@,u) — f(t, v)|| < L llu — || for allt, u, and v. 


Caratheodory extended Peano’s existence theorem to allow for f(t, x) continuous 
in x and measurable in f with a bound || f(t, x)|| < m(t) y(||x||) with m(t) > 0 
integrable in t over [fo, 7], ~ continuous, and i ea dr/p(r) = oo. Uniqueness holds 
if the Lipschitz continuity condition (6.1.4) holds with an integrable function L(t): 


(6.1.5) lft, u) — ft, v)|| < L@) |lu — v|| for allt, u, v. 


We will focus on the case where f(t, x) is continuous in ¢ and Lipschitz in x (6.1.4) 
since numerical estimation of integrals of general measurable functions is essentially 
impossible. 


Theorem 6.1 Suppose f: R x R" — R" is continuous and (6.1.4) holds. Then the 
initial value problem (6.1.1) has a unique solution x(-). 


Proof We use the Peano iteration (6.1.3) to show the solution to the integral form 
(6.1.2) of (6.1.1) has a unique solution. To do that we show that the iteration 
(6.1.3) is a contraction mapping (Theorem 3.3) on the space of continuous functions 
[to, to + 6] > R" for 6 = 1/(2 L). This establishes the existence and uniqueness of 
the solution x: [to, t) + 6] — R”. To show existence and uniqueness beyond this, 
let t} = f +6 and x; = x(f +6). Then applying the argument to 
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= =f@~x), xh)=x1 
we obtain a unique solution x: [t,, t; + 6] ~ R”, and we extend our solution 
x(-) to [t, t) + 6] by x(t) = x) fort, < t < t; + 6. We can continue in this way 
to obtain a unique solution x: [f, T] — R” for any T > fo. By reversing the time 
direction, we can show existence and uniqueness on any time interval ie to] for any 
T <0. 
To show the (6.1.3) is a contraction mapping using the norm 


wll = max _ |lu(s)|l, 
té[to,to +6] 


(x + / f(s, u(s)) as) _ (+0 + / Ff (s, v(s)) as) | 


we note that 


max 
té[to,to +d] 
£ t 
= max | fis, uisyds— f f(s, v(s)) ds 
te[to,to+6 to to 
t 
= max _ | i: [ f(s, u(s)) — f(s, v(s))] ds 
tE[tp,to +d to 
t 
= max, [I f(.us)) — f(s, voy ds 
tE[to,to +d 1 
t 
< max i L |\u(s) — v(s)|| ds (as f Lipschitz) 
tE[to,to +d to 
t 
< max / L |lu— ||, ds 
té[to,to +d to 
< max L |u—ov|l, (¢—%) = Ld |lu— ||, 
tE[to,to +d 
1 
a |u — VIl0- 


The iteration (6.1.3) is therefore a contraction mapping and so has a unique fixed 
point. The remainder of the argument follows as described above, giving existence 
and uniqueness of solutions on any interval [fo, t;] with t; > fo. For t < t9 we can 
simply the apply the above arguments to the reversed time differential equation 
dx /dt = —f (¥, —T) with ¥(—to) = xo for ¥(7) = x(—7). 


If f is differentiable, then 


1 
f= fox = [ Vf(t,x+s(y—x))(y—x)ds, so 


1 
IFC, ¥) — fe, x) <|[ IV, x + s(y —x))I| lly — x]| ds. 
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If ||V f(t, z)|| < B for all z then f is Lipschitz as 


1 
If, y) — Ft, x)Ih <|[ B lly —x|| ds = B |ly—x|. 


On the other hand, provided V f(t, z) is continuous in z, then for any x there is a 
d £ 0Owhere ||V f(t, x)d|| = || VSG, x)|| |d||. So if f is Lipschitz continuous with 
constant L, then 
> lim 

s{0 Ss 


= (VF, x) dll = IVF, x) Ildll 


Ls ld 
L|d|) = lim £54! 
s{0 Ss 


so L has to be at least ||V f(t, x)||. So f is Lipschitz and differentiable if and only 
if || V f(t, z)|| is bounded. 


Example 6.2 The condition of Lipschitz continuity is important both for existence 
and uniqueness. Regarding existence, consider the initial value problem 
dy 


“=| Pe 0) =0. 
e ee yO) 


Using separation of variables we get 


d 
arty = f <= far=s4e so 


y(t) = tan(t+C). 


The initial condition y(0) = 0 means that we should take C = 0. Thatis, y(t) = tant. 
This is a reasonable solution for —71/2 < t < +7/2 but “blows up” as t ~ +7/2 
for example. This is an example of local existence: for t © fo the solution x(t) exists, 
but the solution cannot be extended to all t. 


Note that the function y +> 1 + y? isnot Lipschitz as (d/dy)(1 + y?) = 2y which 
is not bounded as y + too. 


Example 6.3 Regarding uniqueness, consider the initial value problem 


dy 1/2 
= ; 0) = 0. 
Frege y(0) 
For t > 0 we have to have y(t) > 0 because the right-hand side is > 0. Using 
separation of variables we have for t > 0 


6.1 Ordinary Differential Equations — Initial Value Problems 389 


ay? = fy Pays farar+e NTO) 
1 2 
yi) = GU +0), 


The initial value y(0) = 0 implies C = 0, giving y(t) = 17/4. The problem is that 
this is not the only solution. Another solution is y(t) = 0 for all t. In fact, for any 
t* > O there is the solution 


0, ifecg, 
y(t) = “si 
(¢—t*)?/4, iff >t’. 


We can see the right-hand side function is not Lipschitz as 


d | 1 

(2) _ *,-1/2 
ce Ce oa ae 
which is unbounded as y | 0. 


Remark 6.4 There are equations for which solutions exist and are unique that are 
not Lipschitz. If f(t, z) is continuously differentiable then there is Jocal existence: 
consider the modified equation 


(6.1.6) dxr | f(t,Xr), if ||xrll2 < R, 
dt f(t, RXR/||Xr|l2), if |lxella = R. 

It can be shown that the right-hand side function is Lipschitz with Lipschitz constant 
L = maxy.\x\,<Rr || V f(t, x)|| and has the same solutions provided ||xr(t)|l, < R 
for all t. Thus if ||xo||, < R, then there is an « > 0 depending only on ||xo|l2, R 
and maxx. |x|,<r Il f(t, x) lly where there is a solution x(t) to dx/dt = f(t, x(t)), 
X(to) = Xo for fo < t < t+. If the solution can be guaranteed to remain bounded, 
then we can prove existence for all f. 


Example 6.5 An example are the Euler equations for the rotation of rigid body. If 
w is the angular velocity vector of a rigid body relative to co-ordinates fixed in the 
body, then 


dw 4 
(6.1.7) —=J [wx Jwt+T] 

dt 
where J is the moment of inertia matrix, a symmetric positive definite 3 x 3 
matrix, and 7 is the external torque in fixed-body co-ordinates. The “x” is the 
cross product for three dimensional vectors. The right-hand side function wt» 
J~![w x Jw +7] is not Lipschitz: it is quadratic in w and so the Jacobian matrix 
Vu (J lw x Jw + TI) is anon-zero linear function of w, and so is unbounded as 
I~ || > 00. 
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On the other hand, 
d (1 d 
© jw" sw) = wt J = wT [w x Jw+rT] 
T 


= (Jw)! [wx w) twit =u! 7, 


using the rules for cross products that a’ (b x c) = b’ (cx a) andaxa=0. If 
T(t) = 0 for all t, then sw J w is constant. Since J is positive definite, this means 
that w(t) is bounded for all t. The Jacobian matrix is bounded on this bounded 
set { w | su? J w = constant i. and so there exists a unique solution to the Euler 


equations for all ¢. 


6.1.1.1 Gronwall Lemmas: Continuous and Discrete 


A way of showing boundedness of solutions is to use a Gronwall lemma. The original 
Gronwall lemma is due to Gronwall [1 12] and was extended by Bellman [19], LaSalle 
[156], and Bihari [22]. These extended results can be summarized in the following 
theorem. 


Lemma 6.6 (Generalized Gronwall lemma) If 


d 
ZH) S—MVO, forall t, and —r(to) = 10 
where ¢p is continuous positive function and w integrable, then for all t > ty we have 
r(t) < p(t) forallt > to, where 
dp 


a pip) vt), — p(to.) = ro. 


Proof Let G(r) = ie ds/p(s). This is a differentiable increasing function. Then 


d 1 dr 

FOO O) = yO SO: 
2 euije hae 
di” > wtp de 


Integrating the right-hand side gives 


G(r(t)) — G(ro) S / W(s) ds = G(p(t)) — G(7o), so 


r(t) < p(t) for all t > to, 


as we wanted. 
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For numerical methods, we have iterations where we want to show that the iterates 
are bounded in some way. For this, there are discrete versions of Theorem 6.6 such 
as the following. 


Lemma 6.7 Suppose that 
re Sep Se + 00) [Vas —¥G)] fork =0,1,2,3,... 


where p is continuous positive non-decreasing function and W’ = w > 0 is inte- 
grable, with to < t) < tz <---. Then 


rp < p(t,) where 


dp 
FT en plier) vw, — p(to) = ro. 
t 
Proof Let7(s) be the piecewise linear interpolant of F(W (t,)) = rz. Then7 is abso- 
lutely continuous and for V(t.) < s < V(t41), 


TF as re-i — Vk 


tt”) = WaapawGp 2 7) = e(r(s)) 


as rz, <7(s) fors > W(t). Note that if V(t@41) = W(t) then rz4; = rg, and we can 
ignore the interval [%, t,+1]. Applying Lemma 6.6 gives 


F(s) < p(s) — where 


~ 


d a 

= 9A), — PWC) = 70. 
s 

Note that p(t) = p(W(t)) is the solution to 


d 
= = (PWD, plto) = Fos 


which gives the desired result. 


The discrete Gronwall lemma (Lemma 6.7) is useful for bounding numerical solu- 
tions for differential equations. 


6.1.2 Euler’s Method and Its Analysis 


Euler’s method for the initial value problem 


d 
= =f(t.x),  x(to) = x0 
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is based on the approximation 


x(t +h) — x(t) 


h x f(t, x(t)). 


Set t, = to + kh; then the method is 
(6.1.8) Xep1 =X +N S(t, x), k=0,1,2,... 


for computing x, © x(t). 

We expect to get greater accuracy if we reduce the step size h, but doing so means 
we need to use more steps and more function evaluations. To integrate from t = a 
to t = b we need n = [(b — a)/h] steps. Note that [z] is the smallest integer > z, 
also known as the ceiling of z. Each step contributes to the total error; we hope 
that the contribution that each step makes to the total error goes to zero faster than 
(constant/7). 

As an example, take the numerical solutions of dy/dt = 1 + y? for y(0) = 0 over 
the interval [0, 1] using Euler’s method. Some are shown in Figure 6.1.1. 

An important variant of Euler’s method is the implicit Euler method, or backward 
Euler method: 


(6.1.9) Xep1 =Xe +h f(t, Xe1), &€=0,1,2,... 


for computing x, ~ x(t). As with other implicit methods, we have to solve an 
equation for x,%41, which may be nonlinear. In spite of the computational difficulties 
involved with this, the implicit Euler method has some excellent stability properties 
that are discussed more in Section 6.1.6. 
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rors 


Fig. 6.1.1 Solutions via Euler’s method for dy/dt = 1 + y?, y(0) =0 
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6.1.2.1 Error Analysis for Euler’s Method 


To estimate the error, we apply Taylor series with second-order remainder in integral 
form to the exact solution: 


dx feet d’x 
(6.1.10) X(te41) = (te) + an ® h+ / (tea — )7a dt. 


tk 


The remainder term i “(t4) — t)x'(t) dt is the error we would incur with Euler’s 
method if x, = x(t,) were exact. It is called the local truncation error (LTE). The 
size of the LTE can be bounded by 


pr sf wael < [Ne st wl a 
thay -—t)—7 ‘| = thay —t t t 
te ie dt? te + dt? 
ce 25 g 
< thij—t FE t 
= J (tht ae 2 (c) 
a 
=-h* max oe ) 
20 ScStey1 |) dt? 
So we can write the LTE as 
et) dx 1 
tea — t)—~(t) dt = —h? with < max |—;(e 
| ee iP © 2” Tk Im I< te SCSth41 [a - 


Then (6.1.10) can be written as 
dx 1 
(eri) = Ue) +h (te) + She 


Euler’s method is 
Xeq1 =XE+N f(y, xy). 


Subtracting and using the error e; = x(t;) — x; we get 
1 
exr1 = ee t+ hl f (te, x(t) — fe xe)] + 5h? me. 
Taking norms and using the usual inequalities for norms we get 


1 
lexsil] < llexll +A WF Ge. x(t) — f(t eI + ae nx | 


12 
< lhe +A L lee) — xell + sh° |i 


(6.1.11) = (1+ hL) leg + 51M 


where M = max, ||x""(t) | with the maximum taken over the interval of integration. 
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Assuming the initial value x9 = x (fo) is exactly correct, we get eg = 0. Then we 
can apply (6.1.11) inductively to get a bound for many steps: 


k-1 
/1 
k 2 
lel < C+ AL) leoll + doa +hL)! 5h’M 
j=0 
1+AL)*-11 1 M 
ai) WM = ~h[(1+hAL)* — 1] —. 
(+hL)-12 2 L 
Using the bound 
I me 3 AL 
bhi tel ne) +l tse 
we obtain 
1 M 1 M 
6.1.12 < —h[et4)k — 1) = = Hh feb) — J =. 
(6.1.12) leall < sh [lel — 1] = Sh[e IZ 


Provided we consider integrating the differential equation over a fixed interval [fo, T], 
then we can say that the error is O(h). 

For the differential equations dy/dt = 1+ y? for y(0) = 0, the errors at t = 1 for 
Euler’s method for different values of h are shown in Figure 6.1.1(b) along with the 
errors for some methods discussed in the following sections Heun’s method and the 
standard 4th order Runge-Kutta method. The slope on the log—log plot for Euler’s 
method from h = 2~3 to h = 27!° is about 0.934, which is entirely consistent with 
an error of O(h). 

However, integrating a differential over along time period can result in exponential 
growth in the error due to the factor e*“~), This exponential growth is obvious in 
unstable differential equations such as dx /dt = A x with \ > 0. In this example, the 
size of the error also grows proportionate to the size of the solution. But there are 
many differential equations, often described as being “chaotic”, where the size of 
the solution remains bounded, but the differential equation is persistently unstable. 
Examples of such equations include the equations E. Lorenz [166] developed to 
understand weather. The realization came that there are persistently unstable systems 
where a small perturbation at one place and time can be amplified over time to cause 
large changes in the solution later. It has been summarized as the butterfly effect: 
perhaps the beat of a butterfly’s wings in South America could be amplified over 
months to result in a hurricane in Florida a year later. No-one, of course, has been 
able to demonstrate this as there are a vast number of small perturbations, not to 
mention billions of butterflies, that trying to track their influence on a large and 
unstable system like the weather is essentially impossible. However, the fact of 
exponential growth of perturbations cannot be avoided. This means that numerical 
errors can also grow exponentially after they have occurred. 
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Fig. 6.1.2. Solutions of Lorenz’ equations 


To illustrate the effects of persistent instability, Figure 6.1.2 shows a numerical 
trajectory for Lorenz’ equations, and the difference between solutions with two initial 
values that differ by about 3 x 10~°. While the amplification of errors is somewhat 
erratic on small time scales, the overall trend of exponential growth of the difference 
between two solutions until the differences are of order one is clear. 

Lorenz’ equations are 


dx 

(6.1.13) = =x +9); 
d 

(6.1.14) — =—xz+px—y, 
d 

(6.1.15) = = xy — fiz, 


with standard values o = 10, p = 28, and 2 = 8/3. 


6.1.3 Improving on Euler: Trapezoidal, Midpoint, and Heun 


One way of thinking about improving on Euler’s method is to consider the equation 


tht 


X (tea) = X(t) + f(t, x(t)) dt, 


tk 


and try to find a more accurate way to approximate the integral. Euler’s method 
essentially uses the rectangle rule to estimate the integral. By using a second-order 
integration method, we should obtain greater accuracy. There are two main ways to 
do this: 
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th 


f(t) dt * 


tei — tk 


5) [ f(t) + f(tes1)] trapezoidal rule 


tk 
tk 


th + th 
2 


f@dt © (1 — th) f ( ) mid-point rule. 


tk 


These give two implicit methods, the implicit trapezoidal rule and the implicit mid- 
point rule: 


h 
Xey1 = XH 3 [f(t Xn) + f(tee1,Xe41)] trapezoidal rule 


1 1 : : 
Xeq1 =X +h IGG + thai), 5 re +Xx41)) mid-point rule 


The difficulty with these methods is that they are implicit; that is, the solution value 
at the end of the time step x,+1 is in the right-hand side as an argument to f as 
well as in the left-hand side. This means that there is a system of equations to solve 
rather than just a function to evaluate. Methods that just require function evaluations 
J (t, x) and forming linear combinations are called explicit methods. Euler’s method 
is an explicit method. 

Implicit methods are more complex to implement than explicit methods. In gen- 
eral, implicit methods require solvers, which may be simple fixed-point iterations, 
versions of Newton’s method, or some hybrid of these. Other issues that arise include 
error tolerances for solvers, how to incorporate knowledge of the problems including 
preconditioners and specialized iterative methods. 

Because of these difficulties, it is often easier to develop an explicit method that 
captures the desired order without resorting to implicit equations to be solved. Heun 
[123] found a way to keep the order by replacing f (t41, 441) with f (thi, 241) 
where Z;41 is a first-order approximation to x; . This first-order approximation can 
come from Euler’s method. This gives Heun’s method: 


(6.1.16) Zk+1 =xeth f (te, Xx), 


h 
(6.1.17) Kei = e+ 5 [f (te. xn) +f (tent, Ze41)] - 


There is one more second-order method that we will mention here: the leap-frog 
method. This uses the previous two values x,_; and x; to compute x;,4 1 by a version 
of the mid-point rule: 


(6.1.18) Xeay =Xp-1 + 2h SF (th, Xx). 


This is also an explicit method. This method is rarely used because of stability issues, 
which will be discussed later. 
Figure 6.1.1(b) shows that maximum error for Heun’s method for dy/dt = 1+ 
2 
y*, y(O) = 0. 
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Three of these methods, the implicit mid-point rule, the implicit trapezoidal rule, 
and Heun’s method, are all Runge-Kutta methods. The leap-frog method, on the other 
hand, is a multistep method. We will discuss the Runge-Kutta family of methods 
more systematically in Section 6.1.4 and multistep methods in Section 6.1.5. 


6.1.4 Runge-Kutta Methods 


Carl Runge [222] and Martin Wilhelm Kutta [150] developed a fourth order explicit 
method for solving differential equations: 


Vet = f(t, Xx) 


1 1 
v2 = f(t 5h Xe + zhre.1), 


1 1 
13 = f(t alts Xe + 5 h¥E,2), 
va = fet h,xe+ hoy3), 


1 
(6.1.19) Xe, = XEA a [v%,1 “F 204.9 Sw 2043 + vi] P 


Along with Euler’s method, the implicit mid-point rule, trapezoidal rule rules, and 
Heun’s method, this is a single-step method as computing x;,4; only requires X,, 
unlike multistep methods which use more prior values x;, X¢-1, -.-, Xk—p- 

Butcher [44, 46] developed a framework for the analysis of these methods. First, 
we need a consistent way of representing these methods. Butcher used Butcher 
tableaus: the Runge-Kutta method 


(6.1.20) Vj = Fetch, x, th a Pel Baugh 
i=1 
(6.1.21) Xi =Xe +h yb Vij 
j=l 


is represented by the tableau 


C1|411 412 *** Gs 
C2|421 422 +++ A2s A 
al ee ‘ ‘i c 
(6.1.22) ffi ob bor ep. 
Cs1Qs1 As2 +++ Ass 
by bo Pea bs 


The integer s is the number of stages of the Runge—Kutta method. 
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0/0 1)|1 
1 1 


(A) Euler’s method (B) Implicit mid-point 
rule 
0;0 O 
1/1 #O 
T.- A 
2 2 
(C) Implicit trapezoidal (D) Heun’s method 
method 
0 
1/1 
cy". 4 
2 2 
1 1 
|Z Torii 
6 3 3 6 


(E) Standard 4th order Runge-Kutta method 


Fig. 6.1.3. Butcher tableaus of well-known Runge-Kutta methods 


The single-step methods we have already seen can be put in this form. Figure 6.1.3 
shows them. 

One property that is very easily determined from the Butcher tableau is if the 
method is implicit: the Butcher tableau of an explicit method has A strictly lower 
triangular, possibly after a permutation of the stages. 


6.1.4.1 Solvability of the Runge-Kutta Equations 


For the implicit Runge—Kutta methods, we would like to see that there are solutions 
of the Runge-Kutta equations (6.1.20), at least for sufficiently small h provided f 
is Lipschitz. Consider using the simple fixed-point iteration for (6.1.20): 


1 ‘ 
= S(t + ¢;h, ny thy ayo) PHI 2ianns 8 
i=1 


where p is the iteration index. To see if this is a contraction mapping (see Theo- 
rem 3.3), we consider 


iS Sf (te +cjh, eR RS aes j=1,2,...,s: 
i=1 


. (p+1) (pt) || ; (p) (Pp) 
we wish to bound | Vp UK | in terms of |" — Uy; |: 
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s 

(p+1) (p+1) 
|; Tuy | |< Lh |a;i| 

t=1 


(p) (p) 
ee Uy; 


< Lh |All, max |v(!) — u(? 


Provided Lh ||A||,, < 1 the mapping is a contraction mapping, and so there is a 
unique solution to (6.1.20) for sufficiently small h. Substituting this solution into 
(6.1.21) gives x;41 in terms of v;,; for j = 1,2,...,8, which in turn are implicitly 
determined by x,. This gives x,4; = ®;,(x,;). The function ®, clearly depends on 
the Runge-Kutta method and f as well as the step size h. 

We will want to solve these equations even if Lh ||A||,,. > 1. There will be more 
on this case in Section 6.1.6 on stiff differential equations. 


6.1.4.2 Error Analysis 


The local truncation error (LTE) of a Runge—Kutta method is the difference between 
the result of applying one step of the method to x; and the true solution of the dif- 
ferential equation with x(t,) = x, at time th41: Te(XE) = X(te415 Xe, te) — On (Xx) 
where 


ae Xe th) = fx Xe, te), H(t Xe, te) = Xp. 


We can estimate the LTE using the method developed by J. Butcher [45, 46] using 
so-called Butcher trees, as we will see below. First we will look at the amplification 
of errors in the Runge-Kutta method. Essentially we want to estimate the Lipschitz 
constant of ®,. Consider the Runge—Kutta equations (6.1.20) for two starting points 
x; and y,: 


v.j7 = f(t + cjh, xe t+h>° ajivs.i) Pa 1y2eg ss 


i=1 


we = fhe +ejh, y, th) ajwy) jJ=1,2,...,5, so 


i=1 


AY 
lieseanjlet Is 25a ese, ] | 


i=l 


xx — el) 


L 
Then max lv.5 — Weal S IlAlloo 
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The results of one step of the Runge-Kutta method 
®,(x,) =x, +h > DjVk,j, 
j=l 


OY) = IY, +h yo bj wr. j. 


j=l 


can be compared: 


Bien) — nye) | S xe — yell +4 D7 [Oi] les — we, 


j=l 
= (14, PE Meh Ix. — yal 
= 1-AL |lAlleo Me 

Provided h L ||A|l,, < 1/2 we have the bound 

(6.1.23) || ®n(xn) — Oa (y,)|| < + 2Z Illy A) |x — yy) - 


6.1.4.3 Butcher Trees and the Local Truncation Error 


The keys to estimating the local truncation error (LTE) are Taylor series: the Taylor 
series of f(t + ch, x +h v), and the Taylor series of the solution x (f + h). To avoid 
having to deal with derivatives of f(t + ch,x +h v) with respect to both ¢ and x, 
we replace the differential equation with an autonomous equation: 


d|x]|_| f(t,x) x _ | Xo 
al ed 


gives an initial value problem dz/dt = F(z), Z(to) = Zo where z(t) = [x(t)’, ¢]’. 
Note that matching the Runge-Kutta method for (6.1.24) to the Runge-Kutta method 
for the original differential equation gives 


AY 


(6.1.25) a5 for j =1,...,5. 


i=1 


Another requirement on the Butcher tableau (6.1.22) to obtain correct solutions for 
f(t, x) = constant is that 


(6.1.26) 0b; = 1. 
J 


6.1 Ordinary Differential Equations — Initial Value Problems 401 


We drop the tilde in what follows, so we assume the initial value problem is in 
the form dx /dt = f(x), x(to) = xo. Using the notation (1.6.2) from Section 1.6.2 
we can efficiently represent these Taylor series: 


Foe tho) = Fe) +> Tht DE Felv, v,..., vl + OM"), 
k=1 


For the Taylor series expansion of x(t + h) we first note that dx/dt = f(x); then 


dx  d (dx\_ d ee dx, 4 

An (=) mer (f(x) =D OS re ae Ff), 
dx d (dx aa 

dt dt (Sz) = (D FMF) 


=D’ f(x fe), f@)1+ D'fID' fF @IL — ete. 


Butcher’s insight was to represent these expressions using trees: let expr(r) be the 
expression represented by tree 7. We start with expr(e) = f(x). The tree 


where each 7; is a tree, is recursively denoted [7,, 72, ..., 7%]. We recursively define 
(6.1.27) expr([m, T2, .--, T]) = Dé f (x)[expr(m1), expr(T2), ..., expr(Tr)]. 


Since D* f (x)[w,, wo,..., wz] = D* f (x)[v1, v2,..., v¢] Whenever w), Wo,..., We 
is a permutation of v1, v2,..., vg, it follows that if 7), o2, ..., o% is a permutation 
of 71, T2, ..., T% then we can identify [o1, 02, ..., oc] =I[, 72, ..-, 7%]. Identi- 
fying the tree with the expression, we can write 


k 
“tn, 72, .--, Th] =[e, T1, 72; ---; me 2 lavit = vaage Tels 
j= 
And so, 
dx 
dt 
d*x 
dt2 
d°x 
dp 
d*x 
dt* 


=f(x)=e, 
d 1 

ar (f(x)) = D’ f(x)Lf()] = [el], 
d 

ar ([e]) = [e, e] +[[e] ], 
t 


_d 
= Freie e] +[[e]]) 
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Fig. 6.1.4 Butcher trees 7 
( , 


ie ®) () [Ie], ¢] = 
[*] [erm [*, [*]] 


(D) [*, #, #] i (F) [[e], #] 


=[e, e, o] +[[e], ol + [e, [oe] ]+[e, [oe] ] +e, (le, eo] +[le]])] 
=[e, e, o] + 3[e, [o]]+[e, fo, eJ]+[[lel]] etc. 


Readers with a stronger visual sense may find the diagrams in Figure 6.1.4 easier to 
understand. 

We want the results of the Runge-Kutta methods to match the Taylor series expan- 
sions of the solution. So we need to expand f(x; +h >~;_, ajiv;) in terms of the 
Butcher trees. Note that since the expressions Dé Ff (x)[w,, ..., wx] are linear in each 
w ;, the summations )°; a;;v; can be expanded in terms of trees. Also, modifying the 
bracket notation for the trees to allow items that are not themselves Butcher trees, 
we write 


f(x, thw) = f(x.) + D' f(x)[hw] + 5D Fox)linw, hw]+ 


1 
+ 3D f eolhw, hw, hw)]+--- 


I 
e 
+ 
= 
g 
+ 

| 
> 
— 
§ 
g 
+ 

| 
ous 
g 
§ 
& 
+ 


The Runge-Kutta equations (6.1.20) then become 


v= et hl >) ajivji)+ SPUD avi So ajnre | 
i i k 
of La Dave, Dayne +o 
Hoth Saul vl t 5h Doan Daal m, V¢ | 
i i k 
+ al oan Daw Dal, Vg, Vel +--- 


Substituting recursively into this expression for v; enables us to expand this expres- 
sion to any order of h. To get the expansion to third order, we expand v; to second 
order in the term with factor h, expand v; and v, to first order in the term with factor 
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h?, and to zeroth order in the term with factor h?. This gives 


1 
vjsethyajileth > anal vel + sh? >) anaiel ve, ve] 
i k ke 


1 
+ sh Dd ajajcle th Y aipl vp I, eth>- agl vg 1] 
i,k P q 


1 
+ gh De aiiajeaje le, ee] + O(h*) 
i,k,£ 


1 
seth) lajlel+h d fajaicll vel + 5h) | ajainaiel (0%, ve] ] 


i,k ike 


1 1 
+ sh Dajan [e, e]+ 5h S- ajiajeaipl [0p], e] 
i,k i,k,p 


1 1 
+ 5h S- ajiajeaigle, [vg ]]+ ah Sajid jnaje [o, «, ©] + O(h*) 
i,k. i,k € 


Seth vajilel+h > ajaullel +i? > ajaiaccl tLe] 
i i,k 


i,k, 


1 1 
+ 5h Ye ajiainaicl Le. ej]+ 5h Y> ajiajs [e, @] 
i,k,e i,k 


1 1 
+ 5h S- ajiajedipl Le], e]+ 5 do aja jaigle, [e]] 


i,k, p ik. 
1 
+ gD aiiajaje le, 0, ¢} + O(h*) 
i,k,e 
1 
=et+h ji h? ji Gi = ijk [ @, 
e+ 2 ailels [Sevsalten+ § Danente | 


1 
+h {> enauon Ehe dT Yo ajiaixaiel Le, #1] 
i,k,e ike 


1 
+> ajiajeaiplel. e]+ J aiianaje le, eo] ¢ + O0(h*). 
iep i,k,e 
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Now we note that 


Ss 
Xm+1 =Xm + hy bjv; 


Ss 
=Xm +h> > bj eth? bjaji [e] 
j=l ij 
1 
+ 3 |S rovmtte ] “+ 2 > bjajiajx [e, | 


ijk ijk 


1 
“| Ds bjajiaixaxel (Le ]]]+ 5 a bjajiaixaiel[e, e]] 


i jke ijk e 
1 
(6.1.28) + ye bjajiajnaip[Le], e] + 6 > bjajiajeajele, e, | + O(h?). 
i,j,k, p i,j,k, 


There are some rules we can use to simplify these results. The main rule we use is 
(6.1.25): c; = }0; a;;. This simplifies the previous (long) Taylor expansion above to 


ee Sy) eth? Li deik e| 


j=l 


S > bjajicil le 1] + \ bjc2Le, e] 
2; 
i,j Jj 
‘| eraomcetiieine Syaetteat 


ijk rere 
Aer +gbvdl e, 0, 0]} + O(h°). 


Comparing with the Taylor series of the exact solution, we can identify the conditions 
for obtaining a given order for the local truncation error up to O(h*): 


1 1 
X (inst) = ¥(im) the +sh'Le] + gi tle, e]+[lel]]} 


1 
+ hi tle, e,0]+3[[e], e]+[le, e]]+[[[e]]]}} 


+ O(n). 


This gives the conditions in Table 6.1.1. If we use the Hadamard (or component- 
wise) product: uo v =[u,v1}, UzV2, ..., UsU,]", these conditions can be written in 
a simplified way, as shown in Table 6.1.1. 
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Table 6.1.1 Order conditions for Runge-Kutta methods 


Condition(s) Tree Order 
jo =1 e’b=1 e Ist 
Eb = i Pea} [+] 2nd 
dja PIaiiCi = 5 bT Ac = { [Le]] 3rd 
bi} = 3 b’ (coc) =} [e, e] 

Dj biajie? = ay b A(coc) = 5 [Le, e]] 4th 
Di OFajicici = i (boc)" Ac = 3 [Le], e] 
Dyan biaanc, = aq «|b Ae= y [tfel]] 
Dj bict = b’ (cococ)=% [e, o, e] 


John Butcher analyzed the combinatorics of the Butcher trees and their application 
in [46, Chap. 3]. We start with |7| which is the number of nodes of the Butcher tree 
T. Note that ifr =[71, 7, ..., 7] then |r| = 1+ Y74_, |7;|, while |e| = 1. 

The Taylor series with remainder of order m + 1 of the exact solution is 


|r| 
(6.1.29)  xt+h)=xQ+ > 


T:\T|Sp 


ae 41 
a(T) y(7) expr (T)\x=x(r) FOC’) 


where o(7) and (7) are combinatorial quantities that can be computed recursively 
by 


BUT ee ae ait ee ert” |] [ee )). 
—$§ — $< 
my, my m, i=l 
7 
y([r, ...,7%, 7, ...,7%,...,7,...,7]) = Ir [peers and 
ee ne , 
my mo m, i=l 
o(e) =7(e) = 1, 


Note that o(7) is the order of the group of automorphisms of the Butcher tree. 
Since Butcher trees are rooted trees, every automorphism of 7 must map the root 
to the root. For T = [e, e, ..., e] (a wide shrub that is only one edge high with p 
——— 
p—| times 

nodes), (7) = (p — 1)! while y(7) = p, giving o(7T) y(7) = p!. On the other hand, 
fort =[[..., [e]]..., ] (which also has p nodes but is a tall branchless tree), 

—[$S—S_ 

p—| times p—| times 
o(T) = land y(7) = p!, again giving o(T) y(T) = p!. 

On the other hand, applying the Runge-Kutta method (6.1.20, 6.1.21) gives 
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1 
(6.1.30) Xms1=Xm+ > AI (7; A,B) expr(T)lpax, + OD?" 
o(T) i 


T:|T|Sp 


Here ®(7; A, b) is an elementary weight in the sense of Butcher, and corresponds to 
the expressions in (6.1.28) such as ®([[e], e]; A, b) = Dk bj jiGjKAip. The 
general formula is 


Ss 


(6.1.31) er; Aby= D> bo, TI aii 


ij,i2,...ip=1 (.kK)EE(T) 


where E(r7) is the set of edges of 7 directed away from the root; nodes of 7 are 
labeled 1, 2,3,..., p where p = |T|, with node | being the root of T. 

By matching the Taylor series expansions of the exact solution (6.1.29) and the 
Runge-Kutta method (6.1.30), we can see that the method has order p if 


(6.1.32) ®(7; A, b) = 1/y(7) for all Butcher trees 7 with |7| < p. 


6.1.4.4 Butcher’s Simplifying Assumptions and Designing 
Runge-Kutta Methods 


Butcher’s condition (6.1.32) is a wonderful way of testing whether a given Runge— 
Kutta method has a given order of accuracy. However, it can be used as a design tool 
as well. Butcher gave simplifying assumptions that are sufficient, but not necessary, 
for obtaining a given order of accuracy. The conditions are still flexible enough 
to be used as the basis for creating new methods. In particular, these simplifying 
assumptions allow us to create high-order Runge—Kutta methods based on Gaussian 


quadrature methods (see Section 5.2). 
The simplifying assumptions are: 


(6.1.33) B(p): Yonieh = : fork =1,2,...,p, 


Ss 
1, 
(6.1.34) C(q): ajey | = 74 fork =1,2,...,g, i=1,2,...,5 S, 
j=l 


AY 
1 
(6.1.35) = D(r): Voie aij = 7b; — 4) fork=1,2,....7, j=1,2,...,5. 


i=l 


The most important use of these is to show a given order for a method satisfying 
these conditions. 


Theorem 6.8 Ifa Butcher tableau containing A, b and c withc; = vi=1 aj; Satisfies 
conditions B(p), C(q), and D(r) with p<q+r+1 and p <2q +2, then the 
method is of order p. 
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Table 6.1.2 Gauss methods 


1/2| 1/2 
1 


(A) 1 stage; order is two 


(3 — V3)/6 1/4 (3 — 2./3)/12 
oa (3 + 2/3) /12 1/4 
| 1/2 1/2 
(B) 2 stage; order is four 


(5 — V15)/10 5/36 (2/9 — 15/5) (5/36 — 15/30) 
1/2 (5/36 + /15/24) 2/9 (5/36 — 15/24) 
(5+ /15)/10) (5/36 + V15/30) (2/9 + 15/5) 5/36 
| 5/18 4/9 5/18 
(C) 3 stage; order is six 


The proof uses the manipulation of Butcher trees and the order conditions (6.1.32). 
Condition B(p) is necessary to obtain a method of order p as B(p) is necessary to 
integrate f(t,x) = f(t) with order p accuracy. Condition C(q) ensures that the 
integration formula 


Intech . 
i fthdt © WY aij f (tm + cjh) 
tn j=l 
is exact for all polynomials of degree < gq — 1. A proof of Theorem 6.8 is given in 
[45, Thm. 7]. Just as important for applications, Butcher showed that: 


Theorem 6.9 Conditions B(p + q) and C(q) imply D(p), provided the c;’s are 
distinct. 


The simplifying assumptions can be used to create methods that have high 
order. For example, suppose we start with the Gauss—Legendre integration method: 
i a f@dt~h vi=1 b; f(t, + cjh) (see Section 5.2) that is exact for all polyno- 
mials of degree < 2s where s is the number of function evaluations. The values c; are 
the roots of the Legendre polynomial of degree s; that is, (d/dx)° ew a 1)'] =0 
at x = c;. This choice of c;’s and b;’s satisfies B(2s). To determine the coefficients 
aj; we need s” equations. Satisfying condition C(s) gives those additional s* equa- 
tions. By Theorem 6.9, condition D(s) also holds. By Theorem 6.8, the resulting 
method has order 2s. These are the Gauss methods or Gauss—Kuntzmann methods. 
The Gauss methods of one, two and three stages are shown in Table 6.1.2. Note that 
the one-stage Gauss method is the implicit mid-point rule. 
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Table 6.1.3 Radau ITA methods 


1/3) -S/12 12 
1| 3/4 1/4 


3/4 1/4 
(A) 2 stages; order is three 


(4—/6)/10| (88 —7V6)/360 (296 — 1694/6) /1800 (—2 + 36) /225 
(4+ V6)/10| (296 + 169./6)/1800 (88+7/6)/360 (—2—3/6)/225 
1 (16 — /6)/36 (16 + /6)/36 1/9 
| (16 — ¥6)/36 (16 + /6)/36 1/9 

(B) 3 stages; order is five 


Another family of methods that are important are the Radau methods, particularly 
the Radau ITA methods. The Radau methods use the c;’s being the zeros of 


ds 
dxs—! 


[x°'(— x)*]. 


This gives an integration method of order 2s — 1, but we have c, = 1. Radau methods 
satisfy simplifying assumptions B(2s — 1); the values of aj; are determined by sat- 
isfying C(s) giving the necessary s* equations. Theorem 6.9 then implies D(s — 1) 
also holds, and so by Theorem 6.8, the s-stage Radau ITA method has order 2s — 1. 
The one-stage Radau ITA method is the implicit or backward Euler method. The 
Radau ITA methods of two and thee stages are shown in Table 6.1.3. Note that the 
last line of the tableau (b” ) is the same as the second last line of the tableau (e" A). 
This is not a random fact; it is important for some properties of these methods, as is 
discussed in Section 6.1.6. 

Another important family of methods are the diagonally implicit Runge-Kutta 
methods or DIRK methods. In DIRK methods, the matrix A is lower triangular, 
but with non-zero entries on the diagonal. Typically, DIRK methods have the same 
non-zero value in each diagonal entry. These methods have the benefit, that when 
solving the Runge-Kutta equations (6.1.20) for x(t) € R", aseries of s n-dimensional 
nonlinear equations can be solved in turn, while for a general implicit Runge-Kutta 
method, one sm-dimensional nonlinear system of equations must be solved. Even if 
these equations are linear (or we use the Newton method), we expect that solving one 
n x n system of equations takes about 1/s* times as much computational work as 
solving one (sn) x (sn) system. So solving s systems each n x n takes about 1/s? 
times as much computational work. There are many DIRK methods, but here we 
highlight two of them, both three stage methods. The method of Alexander [3] gives 
third order accuracy, while the method of Crouzeix & Raviart [62] gives fourth order 
accuracy. These methods are shown in Table 6.1.4. 


6.1 Ordinary Differential Equations — Initial Value Problems 409 


Table 6.1.4 DIRK methods of Alexander and Crouzeix & Raviart 


The values of the parameters are given by 
a = root of x? — 3x2 + 3x _ é in ¢, 5), 
a © 0.43586652150846 
m7 = 5(1+a), 

b) = —}(607 — 16a + 1), 
bp = +4 (607 — 20a + 5). 

(A) Alexander’s DIRK method 

Y * 
1/2) W/2-y ¥ 
I-y| 27 _ 1-47 7 
| 6 1-25 6 
The values of the parameters are given by 


1 ( T + 1 
= — cos(— = 
Te a ie 
6 = ——— >. 

6 (27 — 1)? 
(B) Method of Crouzeix & Raviart 


6.1.5  Multistep Methods 


Multistep methods use previous solution values x,, X41, .--, Xk—m and interpola- 
tion to compute x;,41. The earliest of these methods are the Adams methods named 
after John Couch Adams who published the methods as an appendix to a paper on 
capillary action of drops of water by Francis Bashforth [17] in 1883. These methods 
are split into explicit methods (the Adams—Bashforth methods) and implicit methods 
(the Adams—Moulton methods). The Adams—Moulton methods had the name “Moul- 
ton” attached because of Ray Forest Moulton’s book [183] in which he showed that an 
explicit method can provide a starting value for the corresponding implicit method. 

All Adams methods are based on interpolation of f(t,x(t)). The Adams— 
Bashforth methods use the polynomial interpolant p,,, ,(t) of f (th, xx), f(th-1,*k-D, 
.-+> S (thm, Xk—m) to compute 


tk+1 


tht 
Xe41 = Xx +/ Pin p(t) dt © xR + S(t, x(t)) dt. 


tk Ik 


Assuming that the time instances %_; are equally spaced: %_; = t, — jh, we can 
write p,, ; aS a linear combination of Lagrange interpolation polynomials (4.1.3) 
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m 


Pmt) = = S (tj, %,-;) Lj) where deg L; = m and 


j=0 
Lin ath Hie 
iO )0, if 7 #€and j =0,1,2,...,m 


We can write 


a ESF Pe 
10 =2,( -*) where deg L; = m and 


es l, ifj=2, 
~ 10, if 7 A and j =0,1,2,...,m 


These L ; do not depend on h. Then 


tht 
Xey1 = XE +f Pin, (t) at 
te 
m 


nt f : De fle je ¥4-j) Lj) dt 


m 


tht 
n+ | 2 Fes jit pL; 


(6.1.36) 12 FG phe of Lj(s) ds. 


j=0 


The values of 3; := is L; (s) ds form = 0, 1, 2, 3, 4 are shown in Table 6.1.5. 
The Adams—Moulton methods, on the other hand, are implicit, and use interpo- 
lation at t = te41, te, .-- 5 th—-m41: 


m-—1 
Qn) = > SF (te—j,Xn-j) Mj(t) ~~ where deg M; = m and 
j=-l 
1, if j=2 


6.1.37) M(t.) = 
hel MOD = V4 if j #€and j =—1,0,1,...,m—1. 


This interpolation uses only one future value f (t,+1, ¥%41), which directly involves 
the unknown x; to be computed. 

Assuming equally spaced interpolation points with spacing h, the Lagrange inter- 
polation polynomials are M ;(t) = M,; ((t — t,)/h) where deg M; =m and 


6.1 Ordinary Differential Equations — Initial Value Problems 411 


tie l, ifj=2, 
: ~ 10, if j A and j =—1,0,1,...,m—1. 


Note that M j does not depend on h. This leads to the Adams—Moulton method 


that m—1 
sini ant | > S (tej, Xx-j) Mj) dt 
th j=- 
m—1 1 
=xeth D> f(te—j.*e-)) i Mj(s) ds. 
0 


j=-l 


Values of y; := i M; (s) ds are also shown in Table 6.1.5. 

The Adams—Bashforth method with m = 0 is Euler’s method. The Adams— 
Moulton method with m = 1 is the implicit trapezoidal rule. 

Another family of multistep methods are the backward differentiation formulas or 
BDF methods. These rely on an interpolant of the solution values: PD, ¢(tk—j) = Xk—j 
for 7 = —1,0,1,...,m — 1. We can then write 


m—1 


(6.1.38) Pret) = Serj MO 


j=- 


where the Lagrange interpolation polynomials M; are given by (6.1.37). The equation 
to be satisfied is then py, ,(te+1) = S (te+1, Xx41)- This implicitly defines the equation 
for x41 given x;,_; for j =0,1,...,m— 1. That is, 


m-—1 


> Xj Mi (thoi) = fF (tes, Xe41)- 
jo 


Writing Mj(t) = M; ((t — t,)/h) we see that this method can be written as 


m—l yy 
M‘(1) 
Xe t Dm xy_j =h Bf eri. Xe41), or equivalently, 
j=0 M* (1) 
m—1 
(6.1.39) a a 3; Kept 8 Ff (tes, Xa). 
j=0 


Table 6.1.6 shows the values of @ and the coefficients a; for different values of m. 

Note that there are no usable BDF methods beyond these six. Using m = 7 or 
higher results in a method that is unstable, even for f(t, x) = 0 for all (t, x). Every 
BDF method is implicit. 
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Table 6.1.5 Adams—Bashforth and Adams—Moulton coefficients 


Bj j 
m 0 1 5 3 4 5 
0 1 
1 
1 3 a 
2 2 
; 23 4 5 
12 3 12 
‘ 55 59 37 Z) 
24 24 24 8 
j 1901} 1387} 109 637 | 251 
720 360 | 30 360 | 720 
‘ 4277| 2641; 4991 | 3649] 959 95 
1440/ 480 720 720 | 480 | 288 


(A) Adams-Bashforth 


Vi j 
m -1 0 1 2 3 4 
0 1 
1 1 
1 a 7 
2 2 
> ail, Boe) ee 
12 3 12 
3 3 19 5 1 
8 24 24; 24 
4 251| 323 11 53 19 
720| 360 30 360 720 
. 95 | 1427 133| 241 173 3 
288) 1440] 240] 720 1440 | 160 


(B) Adams—Moulton 


6.1.6 Stability and Implicit Methods 


Stability is an important concept in differential equations and their numerical solu- 
tion. If we consider the basic differential equation 


dx er 

— =x, the solution is 

dt 
x(t) = x(t) "-™, 
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Table 6.1.6 BDF formulas 


4 ‘ “ 288 in? ms 

25 25 25 25 25 

60 300 300 200 75 12 
: 137 137 137 137 137 "137 

60 360 450 400 225 72 10 
° 147 147 147 147 147 147 147 


The solution x(t) + 0 as t + +00 if and only if the real part of \ = ReA < 0; 
|x(t)| + coast > oo if and only if Re > 0. If Re A = 0, then |x(t)| is constant. 

We would like the stability of our numerical methods to match, as well as rea- 
sonably possible, the stability of the differential equation being solved. If the step 
size h is small, then because of the convergence of the method we expect to have the 
behavior of the method’s results closely matching the behavior of the exact solution. 
The difficulty becomes acute when the value of L h is large where L is the Lipschitz 
constant of f(t, x), but # is modestly small. Normally we might expect that the 
solution would show “interesting behavior” over a time-scale of 1/L, so it would be 
natural to make L h small in order to capture this “interesting behavior’. 

But there are many systems of differential equations where we might indeed want 
to make L h large. Consider, for example, the diffusion equation 


(6.1.40) 2 Mu + Pu AFG (t,x, y))  insid ion Q 
ls = ,X,y, u(t, x, inside a region Q, 

Ot Ox? Ody? - : 
(6.1.41) u=0O _— onthe boundary 0Q. 


This is a non-linear partial differential equation. Using the five-point stencil approx- 
imation (2.4.7) for 0?u/Ox? + 0?u/Oy we get the discrete version of the diffusion 
equation: 


(6.1.42) 
dujj — Mi4n,g +i, jt. — ui + 4i-1,j +i, j-1 
dt (Ax)2 

(6.1.43) uj =0 ify, yj) EQ. 


tft, xi, yj uij) if i, yj) €, 


By combining all the values u;; for (x;, yj) € into a single vector u, we can write 
this as a single differential equation 
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du 

a Baxut f(t, u). 

The Lipschitz constant of the right-hand side is bounded by || Bax || + L ¢ where L f is 
the Lipschitz constant of f. Since || Bax ||) © 8/(Ax)? and L f is usually modest, for 
small Ax the Lipschitz constant of the right-hand side is dominated by || Bax ||. The 
good news is that the eigenvalues of Ba, are negative real. This makes the differential 
equation (6.1.42, 6.1.43) quite stable. But the numerical method has to deal with the 
large Lipschitz constant. The large negative eigenvalues of By, correspond to rapid 
spatial oscillation that is quickly damped out in time. But these large eigenvalues 
can cause difficulties for our numerical methods. Take, for example, Euler’s method 
with f(t, u) = 0: 


ut! uk +h Bau’ = (1 +h Ba,)u'. 


The eigenvalues of J + h By, are 1 +h where 4 is an eigenvalue of Bay. If hA < 
—2 then |1+AA| > 1 meaning that the method has become unstable. Since the 
minimum eigenvalue is ~ —8/(Ax)?, then the time step h should be in the range 
0<h< (Ax)*/4. 

On the other hand, if we use the implicit Euler method, we get 


ut! = uk +h By,u’t! 50 


WaT Hh Pay ee, 


The eigenvalues of (I —h B,,)~! are 1/(1 — A) where is an eigenvalue of 
Bax. Since the eigenvalues of Ba, are negative, 0 < 1/(1 — hd) < 1. This method 
(implicit Euler) is absolutely stable; that is, provided Re A < 0, the method applied 
to dx /dt = Ax is stable. 

Consider the test equation dx /dt = Ax; then x(t41) = ex (t,). For a numerical 
method we wish to find x44. © x(t41) in terms of x, * x(t,). Provided the method 
is linear, we can write x,4, = R(h, A)x,. The function R(A, A) is called the stability 
function; for all the methods we will consider, R(h, A) = R(AA). 


6.1.6.1 Stability of Runge-Kutta Methods 
To find the stability function for (6.1.20, 6.1.21) we substitute f(t, x) = Ax to get 
nia dfn th Yann | JH, wag 55 
i=1 


S 
Xep1 = Xe +h ) bj Uk, j- 
j=l 
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Table 6.1.7 Stability functions for selected Runge-Kutta methods 


Method R(z) Method R(z) 
7 \ 11,2,1,3, 1,4 
Euler 1l+z 4-stage 14+<z¢4 xe + ge t+ agZ 
Runge-Kutta 
1 1+ qe + a2 
Implicit i Gauss 2-stage Tl 
Euler eke 1 3z+ q72 
1+ 42 
Heun 1+z+ Radau ITA 5 3 29 
2-stage 1— 32+ 62 
——.. eal, =| | a7) —Saew likoe— Aaj. 
1+ 5z 1+ (1 —3a)z+ 5(6a@ 6a 4+ 1)z 
Trapezoidal ; Alexander at 5 
1-4: (1 = az) 
( 2-37 +27 - 374 84 
144, 22342 — 374+ 4) 4-21-39) +1 
Mid-point 7 Crouzeix & 2 5 
1-32 Raviart (1 — 92) 


Expressing these equations in matrix—vector form we get 


vy = Aex, + Ah Ady, 


Xep1 = XE +h" vg, 
where e = [1,1,..., 1]”. Solving these gives (I — hd A) vy = Ae x, and so 


Xee1 = Xe +hb™ I —ArXA)! rE xy 
= [1+ hAd\bTI hd A) le] x; 
= R(hr) x, ~~ ~where 

R(hd) =1+hdAb1 (1 — AAA) Ie. 


Stability functions R(z) for z = hX for various methods are listed in Table 6.1.7. 
It should be noted that for a method of order p, R(z) = e® + O(z?*!) as |z| > 0. 
If a method is explicit, then R(z) is a polynomial with the degree equal to the 
number of stages s; in general, R(z) is a rational function of z. If A is an invertible 
matrix, then R(z) > 1 — b’ A~'e as |z| — oo. In this case, and 1 — b’ Ae #0, 
then the degree of the numerator and denominators of R(z) are the same. Otherwise, 
if 1 — b’ A~'e = 0, the degree of the numerator of R(z) is less than the degree of 
the denominator of R(z). 

The stability function R(z) can often be related to Padé approximations (see 
Exercise 4.6.10). A Padé approximation of a function f is a rational function r(z) = 
n(z)/d(z) where f (0) = r™(0) fork = 0, 1,2,..., degn + degd + 1. The order 
of the approximation is m = degn + degd + 1. We call r(z) a (degn, deg d) Padé 
approximation. Then R(z) for the trapezoidal and mid-point rules are both the (1, 1) 
Padé approximations to e*, while R(z) for the 2-stage Gauss method is a (2, 2) 
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(A) Explicit and trapezoidal methods (B) Radau IIA (s = 2) and DIRK methods 


Fig. 6.1.5 Stability regions of Runge-Kutta methods. In (a) the stability regions are inside the 
curves; in (b) they are outside the curves 


Padé approximation to e*, while R(z) for the Radau IIA method is a (1, 3) Padé 
approximation to e*. In fact, for linear equations dx/dt = Bx, we have xx4) = 
R(hB)x,, so finding a suitable method can be reduced to the problem of finding a 
Padé approximation R(z) * e* for z © 0 with suitable stability properties [28]. 

The stability region is the set {z € C: |R(z)| < 1}. Figure 6.1.5 shows the sta- 
bility regions for the Euler, Heun, standard fourth-order Runge-Kutta, trapezoidal, 
2-stage Radau ITA methods as well the DIRK methods of Alexander, and of Crouzeix 
& Raviart. Note that the stability regions of the trapezoidal, mid-point, and 2-stage 
Gauss methods are all exactly the left-half complex plane. 

There are several stability conditions that can be easily checked from the stability 
conditions: a method is called A-stable if the stability region includes all the left-half 
complex plane; that is, |R(z)| < 1 for all z € C where Rez < 0. This is a stronger 
condition than is needed to ensure the method is stable for the diffusion equation. 
Even stronger is the condition of being L-stable: the method must be A-stable and 
lim)zj+co R(z) = 0. Gauss methods, Radau IITA methods, and the DIRK methods of 
Alexander, and of Crouzeix & Raviart, are all A-stable. Of these, the Radau ITA and 
the DIRK methods are L-stable. 

There is a stability condition for nonlinear equations called B-stability. A method 
is B-stable if for the differential equations 


dx/dt = f(t, x), 
dy/dt = f(t, y), 


with 
(6.1.44) (f(t, u) — f(t, v))’(u— v) <0 


for all u and v and ¢ implies that the results of applying the given method results in 
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J ent1 — Intills = [en — Jala: 


To see the relevance of this condition, the exact solutions satisfy 


d 
FT (Ilx(t) — yOI5) = 20 t,x) — f(t, yO)" &M — yO) < 0 


so ||[x(t¢ +r) — y(t +r). < |lx(@) — y(@)||, for any r > 0. This may appear to be a 
difficult condition to check as it appears to require checking all functions satisfying 
(6.1.44). However, both Burrage & Butcher [41] and Crouzeix [63] found a way 
of identifying B-stable Runge-Kutta methods: a Runge—Kutta method is B-stable if 
and only if b; > 0 for alli, and 


(6.1.45) M :=BA+A'B' — bb’ where B = diag(b),..., bs), 


is a positive semi-definite matrix. Note that B-stability implies A-stability. The Gauss 
and Radau IIA methods are B-stable methods, but the above DIRK methods are not. 

There is a problem that arises with Runge—Kutta methods that does not occur with 
multistep methods called order reduction for stiff equations [209]. The order of a 
method is achieved for sufficiently small h, usually when Lh « 1 where L is the 
Lipschitz constant. For stiff but stable differential equations, we might have h > 0 
small, but Lh is not. It is possible that in the range 1/L «< h < 1, the error in not 
behaving like C h? where p is the asymptotic order of the method. To illustrate this 
phenomenon, consider the initial value problem 


dx ; 
(6.1.46) woke? [x-g()], xO0)=g80) 
where D is a large diagonal matrix with at least some large diagonal entries, and g(t) 
a slowly varying function. The solution to (6.1.46) is x(t) = g(t). This is a slowly 
varying function, and so we should be able to get accurate solutions even though the 
entries in D may be large. Consider the modified stability equation 


d 
— = (1) +A\lx—g(t)]__ withRed <0. 


The exact solution with x (to) = g(fo) is x(t) = g(t) for all ¢. 
To analyze the numerical method applied to this equation, we introduce some 
extra quantities that connect the method to g: 


Ag = 8th tejh) — ge) —h ag toh),  j=1,2,....8, 
i=1 


Ac=g(it+ h)—glh)—h se bj g(t + jh). 
j=l 


418 6 Differential Equations 


We combine the components A;; into a vector Ax. If Ay = O(h? +!) we say that 
the method has stage order q, whereas if Ak = O(h?*') we say that the method 
has quadrature order p. Note that these are equivalent to Butcher’s simplifying 
assumptions C(qg) and B(p) respectively. After some calculations, we obtain the 
formula 


xee1 — @(tee1) = R(WA) [xe — g(t) — HAD — AN AY Ag — Ac. 


The s-stage Gauss method satisfies the simplifying assumptions B(2s) and C(s), so 
that the stage order is s while the quadrature order is 2s. The Radau ITA methods 
satisfy B(2s — 1) and C(s) If 1/|A| «x h « 1 we have 


—hXb! (I — hd A)! A, = —b7 (hd) TT — A) Ay 


~ bT A Ay 


for A invertible. If the method is stiffly accurate in the sense that b’ = e! A (that is, 
b is the last row of A), then for 1. large we get 


—hb" (I — hd A)! Ay © e7 Ag = Ags = Ay, 80 
Xk+1 — S(te+i) © RMA) [xe — 8(K)] 
to high order. We still need | R(AA)| < 1 for stability, of course. If0 <h <« I buthA 


is not small, then 


1 1 & 
Xe41 — B(tez1) = R(AA) [xe — g(t) + Acs + mae aa! — A)! Ay — Ax 


1 1 
= R(hd) [xe — g(t) + mae Saat — A)"'A, 
= R(AX) [x — g(t] + OAT" / |hA\). 


If R(AA) 4 1 for hX not small, even if |R(“A)| = 1, then there is no accumulation 
of error over many steps: A; varies only slowly with k so 


k 
DE RON ANT (AAT = AY Ag | = OK" |hA)), 


j=0 


independently of k. While this may still be worse than O(h?), this shows that there 
is benefit to be had from using stiffly accurate methods such as the Radau ITA and 
Alexander DIRK methods over the Gauss and Crouzeix & Raviart DIRK methods 
respectively. 
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6.1.6.2 Stability of Multistep Methods 


The stability theory of multistep methods takes a slightly different form because of 
the different structure of these methods. We will assume a form that encompasses 
both the Adams and BDF methods: 


P P 
(6.1.47) Hea. Gye Bp Ceewey): 
j=0 j=-l 
Applied to the differential equation dx /dt = x we get the linear recurrence 
P 
(1 = AAB_1) xe = > (aj + AA Gj) xK-j.- 
k=0 


The stability of this recurrence depends on the roots of the characteristic equation 


P 
(6.1.48) (1 AdBa)rP*! — YT (aj + hdB;) 7?! = 0. 

j=0 
If we write a(r) = r?t! — ae ajr?~J and B(r) = 4 ir?) then we can 


write the characteristic equation more succinctly as 
(6.1.49) a(r) — hd Br) = 0. 


One difference with the case of Runge-Kutta methods is that the case of \ = 0 can 
still be a problem. This means that applying the method to the differential equation 
dx /dt = 0 is unstable. Such methods do not converge, as errors are amplified by a 
significant amount with each step. 

More specifically, setting \ = 0 gives the characteristic equation a(r) = 0. The 
fundamental stability conditions that are necessary for convergence of the method 
are that 


(6.1.50) every root of a(r) = O has |r| < 1 and 
every root r with |r| = 1 has multiplicity one. 


To see why this is necessary, consider the recurrence x,4) = 24 ajx;. If the 


distinct roots of a(r) = Oarer;, ..., 7m With multiplicities 11, ..., U4, respectively, 
then the general solution of the recurrence is 


(6.1.51) x= gilk)ri,  degg) < vj — 1 forall i, 
i=1 
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where each gq; is a polynomial. If |r;| > 1 for any i, then solutions can grow expo- 
nentially fast in the number of steps, and not the time t — fo. Even if |r;| = 1, then 
the solutions can grow polynomially fast in the number of steps, if the multiplicity 
vy; > 1. This can result in errors growing like O(1/h”—') as h J 0 in this case. 

The Adams methods (6.1.36) have the form 


Pp 
Seisxe th >) ApS Ge Gite) 


j=-l 


so for the Adams methods, a(r) = r?+! — r?. The roots are one, which is a simple 
root, and zero which has multiplicity p. The Adams methods clearly satisfy the basic 
stability condition (6.1.50). BDF methods, on the other hand, have the form 


Pp 
epi =) ojxj th Ba f (ters, Xe). 
j=0 


It is far from being a foregone conclusion that BDF methods satisfy (6.1.50). The 
BDF method for p = 1 is 


4 1 2 
Key = ak — 3 Xk +h 3 S (that, Xe41)3 


for this method a(r) = r? — $r + 4 whichhasrootsr = (3 + ,/ (4)? —4 x 1 x 4)/ 


= : ae i of which one root is equal to one and the other is 1/3, and so satisfies the 
basic stability condition (6.1.50). Every BDF method for p = 0, 1, 2, ..., 5 satisfies 
(6.1.50), but the BDF method for p = 6 fails (6.1.50). Larger values of p also fail 
(6.1.50). This is why there are only six BDF methods. 

There is also a consistency condition that we need: if dx/dt = 0 and x(0) = 1, 
the solution should be constant. If we want this to hold for our method, then we need 


(6.1.52) = >" ae 
To obtain a consistency order of m > 1 we need 
P P 
(6.1.53) Yra(-/ita >> B(-fTt=1 forg =1,2,...,m. 
j=0 j=-l 


In our derivations of the Adams and BDF methods, we did not need to establish 
the consistency conditions (6.1.52, 6.1.53) directly. Rather we used the accuracy 
properties of polynomial interpolation to provide the consistency properties directly. 
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The combination of stability (6.1.50) and consistency conditions (6.1.52, 6.1.53) 
imply convergence of the method. To see this, we note that the consistency conditions 
imply that the local truncation error is 


Dp P 
Te =X(fers) — oj e(e_)) —h D> Bj x'G_)) = OF"), 
j=0 j=-l 


by expanding x(t + 5) using Taylor series with remainder of order m + 1 in s. We 
can then write 


P P 
X(te41) = Yo aj x(t-j) +h >. Bj f tej, X(te_j)) + Tr 
j=0 j= 
P p 
Yet = Ya; Xe-j th se Bj f (th—j, Xk-j). 
j=0 j= 


Subtracting and setting the error e; = x(t;) — x; we get 


Pp Pp 
C1 = Ya; ex-j th » Bj [F (tej, X(h-{)) — f (tej, Xe j)] + Te. 


j=0 i=- 


Assuming that f(t, x) is Lipschitz in x with Lipschitz constant L, we get 


Pp 
C1 = ) Oj ek-j + Ms with 
=p 


Pp 


ln. || < AL > |2;| lex; || +O0(h™*!), 


j=- 
To find the solution we first find the solution to the linear recurrence 


p ‘ 
1, ifk=0O, 
= oe yey * 0, otherwise ve 
iO ; ; 


&=0 fork <0. 


From the basic stability condition (6.1.50), there is a constant M where |6;| < M@ 
for all k. We can use a discrete convolution to write e, in terms of the 7; plus a 
bounded term that comes from the starting error values e@9, €), ..., @p—1. Assuming 
thate97 =e; =--- = ep-1 = 0, 
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Then we can obtain bounds 


k 
lexpill < se 41-1] In) 


J=p 


k Pp 
<)imM (i Y~ [Gel llej-ell +o) 


J=P é=-1 


If yy = max j<x lle; | then @ < k implies We < vx, so 


k Pp 
Ver S yim (i ~~ [Be| Wye sour) 


J=pP f=-1 


k 
< JOM (AL Il dir +O") and 
J=P 
1 k-1 


Wen < ————— 9 M (AL Bll, Yj + OCA"). 
k+1 [= MLA ( 1 j+i1 ) 


This gives a bound on y, of the form exp(c(t — f9)) O(h”) for a suitable constant c. 
This implies that ||e;|| < exp(c(t — to)) O(h”). 

Stability regions for multistep methods as for Runge-Kutta methods are defined 
as the set of / values in the complex plane for which the multistep method applied 
to dx/dt = x gives a stable recurrence. This amounts to showing that all roots r 
of the characteristic equation a(r) — hi B(r) = 0 (6.1.49) lie within the unit circle: 
|r| < 1. The boundaries of these regions are given by hA = a(r)/((r) for |r| = 1. 
The stability regions for the Adams methods are shown in Figure 6.1.6. Note that 
the lobes of the AB4 method protruding into the right-half plane should be excluded 
from the stability region; otherwise the interior of the curves shown is the stability 
region. The stability regions for the BDF methods are shown in Figure 6.1.7; the 
regions to the left of, and outside, the curves shown are the stability regions. 

It should be noted that of these multistep methods, only implicit Euler (AM1 
and BDF1), the implicit trapezoidal rule (AM2), and the second-order BDF method 
(BDF2) are A-stable; that is their stability region includes the entire left-half complex 
plane. The third-order BDF method is close to being A-stable, but its stability region 
misses a small part close to the imaginary axis. No multistep method with order 
higher than two is A-stable, as was shown by Dahlquist [67]; this is an example 
of an order barrier. The excellent stability properties of the BDF methods make 
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(A) Adams-Bashforth methods (B) Adams—Moulton methods 


Fig. 6.1.6 Stability regions for Adams methods 


10; 


-10 
-10 


Fig. 6.1.7 Stability regions for BDF methods 


them useful methods for many applications, such as differential algebraic equations 
(DAEs) as well as partial differential equations in time. 


6.1.7 Practical Aspects of Implicit Methods 


Implicit methods require solving a system of equations in general, and in general, 
these equations are non-linear. As we have seen in Chapter 3, there are a number of 
ways of solving systems of non-linear equations, such as fixed-point iteration and 
variants of Newton’s method. Using Newton’s method involves solving systems of 
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linear equations, and there is much about how to do this in Chapter 2, whether using 
direct or iterative methods. Knowing something of the structure of the equations 
can help design good methods. Ultimately, the methods chosen will depend on the 
system of differential equations being solved, and so the designer of the software for 
implicit methods needs to allow the user some flexibility as to how to do this. 

Most of the implicit methods we have considered lead to the problem of finding 
x satisfying 


(6.1.54) x=ut+hZG f(t,w+ax) 


given u, w, a, 3, and of course f. Exceptions to this rule are the Runge—Kutta 

methods where the equations (6.1.20) are fully implicit. DIRK methods (where the A 

matrix of the Butcher tableau is lower triangular) can be decomposed into s equations 

like (6.1.54) to be solved sequentially. Even fully implicit Runge—Kutta methods can 

be streamlined to avoid the full complexity of the general Runge-Kutta system. 
Fixed-point iteration can be applied to (6.1.54): 


(6.1.55) x") —-y+hG flt,wtax™), m=0,1,2,... 


This will converge provided h |a 3| L < 1 where L is the Lipschitz constant of f. 
Unfortunately, this strategy will likely fail when applied to stiff differential equations 
with L large and h small, but Lh > 1/|a |. Often in practical situations we can 
write f(t, z) = Az + g(t, z) where A is a large matrix both in terms of its norm 
and in terms of the number of rows and columns. This often arises in dealing with 
partial differential equations, for example. Suppose that Lg is the Lipschitz constant 
for g and Ly < ||A||. Provided the large eigenvalues of A have negative real part, we 
can use this together with some suitable linear solvers to obtain efficient methods: 
(6.1.54) is then equivalent to 


x=ut+h(@[A(wt+ax)+g(wtax)| 
and we can re-write the system as 
(l-hBaA)x =ut+hG[Aw+ g(wtax)| 
for which we have a modified fixed-point iteration 
x) = (1-hBaAd)' {u+hB[Aw+ g(w+ax)]} 


which converges provided | Ud - hBaA)'| h|aB| Lg < 1. We do not expect 
| (I—h®BaA)! | to be small, but we can reasonably hope for it to be modest. For 
example, if — A is positive definite, then | (I-—hBaAda)! l, < | provided af > 0. 

Flexibility from the software may be desired to allow the user to provide effi- 
cient means of solving the equation (J — h Ga A)x = y forx given y. The user can 
provide or designate specific iterative methods and preconditioners tailored for the 
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user’s problems. Exactly how this should be implemented depends on the program- 
ming language and software standards in use. 

One approach is for the user to pass an nlsolver function to the odesolver function 
for solving the differential equation. The n/solver function takes as input the data for 
the problem, plus whatever other data the user can provide to help solve the equations 
efficiently: 


function nalsolver(u,w,a,hG, f, p,...,© 
return x 
end 


The p, .. . inputs can be parameters, functions, links to preconditioners etc. The ODE 
solver then has the form 


function odesolver(f,xo,error tolerances etc.,nlsolve, p,...) 


// now to solve x=u+hQ@ f(t,w+ax) 
e<... // accuracy needed for solving equations 
x <oalsolveu,w,a,hB, f, p,...,6 


end 


In this way, odesolver does not have to worry at all about how the solution is found, 
only the accuracy needed for it. If €o is the overall accuracy desired, then we should 
have € © €, h. If the solver fails, there should be a mechanism for passing the fact of 
the failure, plus useful information about the cause of the failure, back to the end- 
user. Many programming languages have error/exception mechanisms for doing this 
without requiring cumbersome error code returns that have been the mainstay of 
scientific computing in Fortran, for example. 

Fully implicit Runge-Kutta methods pose a slightly different challenge. The 
Runge-Kutta equations (6.1.20) 


Vj =e +cjh, xe +h) air) fF =1,2,...,8, 
i=1 


can be solved by Newton’s method and its variants. To do this involves computing 
the updates w, ; for vz ;, j = 1,2,..., 5, by means of the linear systems 


Ss Ss 
Wii — Y Veh te tej, Xe +A YD ajidh,;) haji We, 


i=1 i=l 


i=l 


=— 74,5 — f(% + ch, meh Qavno} j=1,2,...,5. 
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If we let Jj = Ve f(t +ejh, x, +h pe ajiV,,1) be the Jacobian matrix of f for 
stage j of the method, then we can write the linear system more succinctly as 


Ss 
wij —h YY aji Jj wei = 4), PS ly 2 ees 88 


i=1 


If each J; isn x n this is ans x ns linear system. Using a direct solver, like LU 
or Cholesky or QR factorization, will typically result in O(n*s*) operations, which 
can be rather expensive. (A three-stage method would result in the time to solve 
increased by a factor of about 3° = 27.) This linear system can be written out in a 
more extensive form: 


T—hayyJ, —hajd, ++: —hays WK,1 Ox,1 
—hay J, I — hardy +++ —hars Jn Wk2 Ox,2 
—hay Js —has2 Js ee Pe hass Js Wks Oks 


While there are few evident shortcuts for solving such a linear system, at least the 
structure is clearer. We can, however, look for good approximations for which we can 
solve the system quickly. If Jj; ~ Jo © Vx f (tg, Xx), then we have the approximate 
linear system 


I—hayJo —hayJo +++ —haisJo Wk1 Ox, 
—hay Jo I — hag Jo +++ —hars Jo Wk2 OK,2 

(6.1.56) : . ; = : 
—hay Jo = —has2Jo +++ I — hass Jo Wks Oks 


This can actually be solved much faster using Kronecker or tensor product methods: 
for A (r x s) and B (m x n), A® B is the rm x sn matrix 


a B ay2B--- aysB 


a2,B axB sees ao5B 
(6.1.57) A®B= : 


a,\B a,2B nee a;sB 


Tensor products have the following properties: 


e a(A @ B) = (aA) ® B= A ® (QB) for any scalar a; 
®PASB®BF+AQC=HAO(B+C)andA@QC+BOC=(A+B) OC; 
e (A® B)(C ® D) = (AC) @ (BD), s0 (A@ BY! = A'@B; 

© (A@B)’ =A! @B'; 

e if A and B are upper triangular, then so is A © B. 
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The eigenvalues of A © B are the products Ay where A is an eigenvalue of A and 
jc is an eigenvalue of B. A quick way to see this is to note that if U' AU =Sand 
V' BV = T are the Schur decompositions (2.5.7) of A and B respectively, then 


U® V'(A ®B)U@V)=S@T _ upper triangular. 


Since U and V are unitary, U~' = U’ and V-! = WV so(U@V)!=U'®@ 
vaso ® Vv 20 ® V'.As S ® T is upper triangular, its eigenvalues (which 
are the eigenvalues of A @ B) are the diagonal entries of $ @ T, which are products 
of diagonal entries of S and diagonal entries of T. That is, each eigenvalue of A @ B 
is the product of an eigenvalue of A and an eigenvalue of B. Furthermore, every 
product of an eigenvalue of A and an eigenvalue of B is an eigenvalue of A © B. 

The system of equations (6.1.56) can be represented using Kronecker or tensor 
products: 


I @I-hA®J/)w=od. 
If we have a Schur decomposition U' AU =Sof A, then 


(UU@I)(I@I—-hU AU@AH)\U’ @lw=6; _ thatis, 
I@Il-hS®A)U Qlw=U @N6. 


Note that since S is upper triangular, S ® Jo is block upper triangular, and each block 
is a scalar multiple of Jo. Block backward substitution can then be used for these 
equations; the crucial step is solving (I — h sx Jo)z = b for each k. For many cases, 
this approach is problematic as it involves doing complex arithmetic and complex 
function evaluations. It is probably better to use the real Schur decomposition, which 
has a2 x 2 diagonal block in S for each complex conjugate eigenvalue. This means 
that a linear system of the form 


I-haldh +hBJo zil_ [bs 7 7 
—hBJo | lela or (l—had)z=b 


needs to be solved at each stage of using block backward substitution. This means 
the total time for solving the linearized Runge-Kutta equations can be reduced 
to O(n>s + n*s7). In fact, if A is diagonalizable then the cost can be reduced to 
O(n3s). For example, the 3-stage Gauss method, which has order 6, has eigen- 
values © 0.16410 + 0.428207 and ~ 0.17172. The three-stage Radau IIA method, 
which has order 5 and is stiffly accurate, has eigenvalues © 0.16256 + 0.18495 i and 
=~ 0.27489. The A matrix in both cases is diagonalizable, and there is a real invertible 
matrix X so that X~! AX is real block diagonal with a2 x 2 block anda 1 x 1 block. 
So instead of solving a 3n x 3n linear system at each step of Newton’s method, we 
can solve ann x n anda2n x 2n system, which can give a significant improvement 
in the computational cost. 
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6.1.8 Error Estimates and Adaptive Methods 


In many differential equations, there are periods of relatively little activity, followed 
by short bursts where things change rapidly. Take, for example, the Kepler problem: 


(6.1.58) m =-GMm : 
dt? IIxll3 


This describes the motion of a planet around a relatively large star, like the Earth 
around the Sun, ignoring the effects of the other planets and celestial objects. It also 
describes the motion of asteroids. As is well known, the solution to (6.1.58) forms 
elliptic orbits with the Sun (at x = 0) at one of the foci of the ellipse. Asteroids can 
have orbits with high eccentricity, so that while they linger for long periods far from 
the Sun, they fall toward it, and pass by close to the Sun where they have high velocity 
and high acceleration that changes direction quite rapidly. Solving such differential 
equations poses some challenges that we have not dealt with explicitly so far. 

Our asymptotic convergence analysis indicates that (except for when the body 
impacts the Sun) as long as the step size is “small enough” then the error is also 
small, and we have an asymptotic estimate for how large that error would be. The 
standard fourth order Runge-Kutta method has a global error of O(h*) as h > 0. 
However, this avoids the question of how small is “small enough”. This is clearly 
problem dependent, but even with m = 1 and G Mm = 1 in (6.1.58), for the initial 
conditions x9 = [1, 0]? and dx/dt(0) = [0, 1/10]", we find that even h = 10-4 
for the standard fourth-order Runge-Kutta method is not sufficient to make the error 
invisible to the eye. Figure 6.1.8 shows the results for exactly this case. The plot on 
the left shows the orbits with the Sun (x = 0) denoted by a red asterisk. The upper 
right of Figure 6.1.8 shows the size of the acceleration vector a(t) = x” (t), while the 
lower right shows the energy function E(t) = sm Il v(t) 15 — GMm/ ||x(t)|| where 
v(t) = x'(t) is the velocity vector. Exact solutions of the Kepler problem conserve 
energy; that is, E(t) = E(O) and is constant. However, the errors incurred in the 
numerical method by the close transit of the orbit near the Sun are large enough to 
produce clearly visible changes in the energy. 

The solution is clearly to use smaller step sizes! This is a wasteful use of com- 
putational resources, since for most steps the acceleration is relatively modest, and 
large steps can be easily accommodated. So we want large time steps when we can 
use them, and short time steps when we must. 

Adaptive step sizes are built into most modern ODE solvers. To implement them, 
we need to be able to estimate the size of the errors, and then change the step size to 
achieve a given error target. But we should be clear about what an adaptive method 
can and cannot do. The exponential growth of perturbations in persistently unstable 
differential equation like the Lorenz equations (6.1.13-6.1.15) (see Figure 6.1.2) 
means that to accurately predict errors, we would need to know how much errors 
are amplified before we have solved the equations. Instead, adaptive methods use 
local information from one or two steps to estimate the local truncation error (LTE), 
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Fig. 6.1.8 Results for Kepler problem with xo = [1, 0]", vo = [0, 1/ 10]? with fixed step size 
h = 10~* and the 4th order Runge-Kutta method 


and adjust the step size h so that the LTE per unit step is less than the user-specified 
target. 

We need several new components in a successful adaptive method: a way of 
estimating the LTE, a way of determining the new step size, and a way to change the 
step size. For Runge-Kutta methods, changing the step size is trivial. For multistep 
methods, it is more complex. In either case, Algorithm 63 shows an outline of how 
the control mechanism can work. In Algorithm 63, p is the order of the underlying 
ODE method. We have Amin < 4 < hmax for all h to achieve a target LTE per unit 
step < «. The parameter 0 < y < | is used to avoid excessive changes in the step 
size h, and to take into account the fact that the computed LTE estimate 7 is only 
an estimate. If the h update formula in line 16 was h < [e/(7/ h)|!/ ? h then the 
predicted LTE per unit step is «. However, any variation on this could result in the 
LTE per unit step estimate again exceeding e€ on the next pass through the while 
loop. This can lead to an infinite loop where h is almost small enough to get T/h < e. 
Instead we have a “fudge factor” 0 < 7 < | to ensure that this infinite loop does not 
occur. While frequent changes in step size can be easily accommodated with Runge— 
Kutta methods (line 18 is empty for Runge-Kutta methods), it may be damaging to 
multistep methods, as we will see. 
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Algorithm 63 Adaptive ODE 


6 Differential Equations 


method 


function adaptiveodesolver(f , x0, to, tend, ho, hmin, Mmax, €, Y) 


; done<—t>tena; h<ho; k<O0 


t; done < true 


compute solution estimate ¥ 


E estimate T 


true; x <x; t<tt+h 


e or T/h>e 


h — min(max(h, Amin), hmax) 
change step size toh 


Xe< xX; <t 


al 

2 t<t; x <—Xx9 

3 while not done 
4 if t+h = lend 
5 h < tend — 

6 end if 

7 accepted < false 
8 

9 compute LT! 
10 if t/h<e or h=hypin 
11 accepted <— 
12 else 

13 done < false 
14 end if 

15 if t/h<hy 
16 h <—[ye/(r/h)]!/? h 
17 

18 

19 end if 

20 if accepted 
21 kK<—k+1; 
22 end if 

23 end while 

24 return (Xx0,X1, 


25 end function 


veng Me Udy tiyeses tk) 


We now need to see how 


to estimate the local truncation error (LTE). For all 


methods of order p, one approach is to compute the results for one step of the 


method and another step of th 


e method with half the step size. The difference in the 


results divided by (1 — 27”) is then an estimate of the LTE for the larger step size. 
However, this doubles the cost of each step, and for implicit methods, this can be 


very substantial. 


For Runge-Kutta methods, a more popular approach is to have a pair of over- 
lapping methods, one of order p and the other of order p + 1. An example is the 
Runge—Kutta—Fehlberg method of orders four and five [89] is represented by the 


Butcher tableau 


(6.1.59) 
0 
1/4] 1/4 
3/8 | 3/32 9/32 
12/13]1932/2197 —7200/2197 7296/2197 
1 | 439/216 —§8 3680/513 —845/4104 
1/2.|| =87 2 —3544/2565 1859/4104 —11/40 
b | 25/216 0 1408/2565 2197/4104 —1/5 0 


n~ 


b 16/135 0 6656/12825 28561/56430 —9/50 2/55 


6.1 Ordinary Differential Equations — Initial Value Problems 431 


The fourth order and fifth order methods are 


Xk =x, th) b; vj, 


j=l 


AY 
Xep1 = Xe +h ) bj vj, respectively. 
j=l 


To estimate the LTE we compute with difference T, = ¥,41 — Xx41 with X, = x; 
and set T = ||Tx|| for use in Algorithm 63. Note that 7, = h Via bj — bj)vj, so 


r/h = |Sj-16; —8)))]. 

This value 7; is an asymptotic estimate of the LTE for the fourth order method. 
Yet, often in practice, the fifth order method is the one that is actually used to update 
x;+41. It is perhaps strange to use the method for which the error estimate does not 
actually apply, but usually the LTE estimate for the fourth order method is much 
larger than the actual LTE for the fifth order method. 

Multistep methods can also be used to efficiently obtain LTE estimates, by re- 
using information generated by the method itself. One approach is to use, say, the 
Adams-Bashforth method of order p + | to estimate the LTE for an order p method 
of either the Adams—Moulton or BDF methods. The algorithm DIFSUB of Gear [100] 
uses a more sophisticated approach. The DIFSUB algorithm is also a variable order 
method. Variable order methods avoid the start-up problems for multistep methods 
as they start with order one methods: Euler’s method or the implicit Euler method, 
which requires no x,_; values for j > 1. Once the step size is small enough for 
Euler’s method for a few steps, then the order used can be increased to two, allowing 
the step size to increase. In a few more steps, the method can be increased in order to 
third order, with a larger step size. This increase in order and step size can continue 
until the maximum order is reached, or increasing the order gives no increase in the 
step size. 

Changing the step size for multistep methods is not a cost-free task. If the step 
size h is changed, then the statement that x,_; © x(t_j;) =x(&% — jh) becomes 
invalid. Instead, we use polynomial interpolation on the computed x;_ ; to interpolate 
approximate values x; peek - jh*) for the new step size ht. This process 
introduces errors in the new values ri ;- Indeed, repeatedly changing the step size 
can result in substantially larger errors than predicted by the LTE estimates. This 
means that it is important to set the value + in Algorithm 63. 

More advanced variable step size and variable order/step size methods have been 
developed, such as the Sundials suite [127], VODE, and CVODE (based on a previ- 
ous code LSODE). The MATLAB differential equations suite, written by Shampine 
[232], is also an example of an excellent collection of numerical ordinary differential 
equation solvers with step-size control. 
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6.1.9 Differential Algebraic Equations (DAEs) 


Differential algebraic equations (DAEs) [10, 30] are a combination of a differential 
equation and a system of “algebraic” equations: 


d 
(6.1.60) = =f(t.x,y), x(t) =x0 ER", 


(6.1.61) O=gt,x,y), ylto)=yo ER", 


with the condition that g (fo, x0, Yo) = 0. Often DAEs can be turned into differential 
equations, most commonly by solving (6.1.61) for y to give y = h(t, x) so we can 


set d 
x 
a S(t, x,h(t,x)), x(t) = Xo. 


In many cases, the equation g(t, x, y) = 0 cannot be solved in this way, and even if 
we could (locally), we still need numerical methods to solve the system of equations. 


One way of doing this is to approximate the system (6.1.60, 6.1.61) by the system 
of differential equations 


dx. 
(6.1.62) | = fUxX. IY), Felt) =x ER", 
dy, m 
(6.1.63) Ht BEE Ved: — Velto) = Yo € R", 


and take € | 0. The matrix B is chosen to be invertible and to make (6.1.63) a stable 
differential equation so that the equilibria are the solutions of g(t, x, y) = 0 for given 
x. 

Since we are interested in € | 0, the effective stiffness of the DAE (6.1.60, 6.1.61) 
is infinite. We therefore seek methods designed for stiff ODEs to use here. These 
methods can be either implicit Runge-Kutta methods or BDF multistep methods. Of 
the Runge-Kutta methods, we should restrict attention to stiffly accurate methods: 
b" = e’ A in the Butcher tableau. 

A good example of how we use DAEs comes from the equations for an idealized 
pendulum problem (Figure 6.1.9). There is a tension N in the string connecting the 
mass m to the pivot point (0, 0). The position of the mass is (x, y), and the main 
constraint is x? + y* = L?. The tension N is closely related to a Lagrange multiplier 
) for the constraint that x7 + y? = L?. 

The differential-algebraic equations to solve are 


ad? N 
(6.1.64) m— = — 

dt / x2 + y? 

ad N 
(6.1.65) id id 


m = mM g, 
dt? [x2 + y2 g 
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Fig. 6.1.9 Pendulum 


(6.1.66) O=L?—x?-y’. 


We have differential equations for x and y; these are the “differential” variables. The 
other variable N is defined implicitly through the “algebraic” equations. Unfortu- 
nately, we cannot solve the algebraic equations for N in terms of x and y. However, 
we can differentiate the algebraic equations (6.1.66) to get 


d 


0= 
dt 


(eZ < 7) = 2x 2y 


This still does not give us a way to determine N in terms of x, y, dx/dt, dy/dt. 
Differentiating one more time, using u = dx/dt and v = dy/dt, we get 


d dx dy du dv 
A a 2—u — 2—v — 2x 


2u2 — 2v2 —2 ay 2 No 
= —2u v x y 8g 
mfx? + y? myx? + y? 


N(x? + y? 
oe aga OTIS 255 


m /x2 4 y? 


This now gives a way to solve for N explicitly in terms of x, y, u, v: 


0= 


m 2 2 
ae rer + v —gyl. 


This formula for N can be substituted into the other pendulum equations (6.1.64, 
6.1.65) to give a complete system of differential equations. Alternatively, we can 
obtain a differential equation for N with one more differentiation with respect to 
time. 

This gives us multiple ways of formulating the same problem. Associated with 
each formulation is an index. This is the number of times the “algebraic” variables 
need to be differentiated in order to obtain differential equations for the algebraic 
variables. The original formulation (6.1.64—6.1.66) has index three, since three dif- 
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ferentiations with respect to ¢ are needed to getd N /dt asafunctionofx, y, u, v, N. 
The formulation with algebraic equation 


0=—2xu—2yv 
has index two, while the formulation with algebraic equation 


N(x? + y’) 
5 + 28y 


mfx? + y 


has index one. When we have a complete system of differential equations the system 
has index zero: it is just a system of differential equations. 
Suppose we use the following formulation 


0 = —2(u? + v?) +2 


dx 
6.1.67 Saag, 
( ) ae 
dy 
6.1.68 aa, 
( ) aoe 
du Nx 
6.1.69 = 
( ) me ares 
dv Ny 
6.1.70 = 
( ) Mm. ag mg 
ue 2 2 
(6.1.71) [wu +v°-— gy] 


N = ——— 


and substitute (6.1.71) into (6.1.69, 6.1.70). The differential equations together with 
consistent initial conditions Gé + Ye = L?, xoup + youo = 0) imply that x(t)? + 
y(t)* = L? for all t. However, our numerical methods do not give exact solutions, 
but rather approximate solutions. Since we obtained our formula for N after taking 
two derivatives with respect to t, our numerical solution can be modeled as a solution 
to 

a ( aga 1”) x 

dt? = 


where € = O(h”) is a bound on the LTE per unit time step; p is the order of the 
method. For fixed time intervals, the error in x7 + y? — L? is O(h?) as expected. But 
for long time intervals, the error grows to be O(h? (t — ty)?) provided h? (t — tyr < 
1. If we integrate for longer than |t — f9| ~ h~?/, the mechanical energy can change 
sufficiently to cause an even more rapid growth in ee +y—L? |. This drift can be 
clearly seen in Figure 6.1.10, which shows x(t)* + y(t)* — L? against t for using 
Heun’s method with step size h = 10~. (This uses m = 1, L = 1/2, and g = 9.82.) 

By using the original DAE formulation (6.1.64—6.1.66) we can ensure that x? + 
y? — L? = 0 is maintained. However, the order of the method used must at least 
exceed the index of the DAE. Now we will look at some methods for solving DAEs. 


6.1 Ordinary Differential Equations — Initial Value Problems 435 


Fig. 6.1.10 Drift in x 104 
(6.1.67—6.1.71) using Heun’s 5 r r 
method with h = 10-2 
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6.1.9.1 Runge-Kutta Methods for DAEs 


We need to use implicit Runge—Kutta methods that are stiffly accurate. But we need to 
see how we can adapt Runge-Kutta methods to DAEs (6.1.60, 6.1.61). The standard 
Runge-Kutta equations (6.1.20, 6.1.21) 


S 
VEG =f (te +ejh, xe +h  ajir,,) J =1,2,...,5, 


i=l 
Ss 

Xeg1 =X +h y bj VK, j 
j=l 


apply to the equation dx /dt = f(t, x). If we think of a DAE as the limit as € | 0 of 
(6.1.62, 6.1.63) 


dx, 
- = t, €> Se)> c(t = ? 
a f(t, Xe y.) X<(to) = Xo 
dy, 
€ ae = B g(t, X., y.) y-(to) = Yo, 


would give the approximate equations where w,,,; correspond to dy,/dt values, 
AY Ss 
Vek, j = f(y + e;h, KEARNY ajive kis Ve +N YD ajiWe ki) jJ= De Qe ook enh 
i=l i=1 


AY AY 
eWe kj = Bate +cjh, xe +h > averse eth Yo ajiwex.i) j=1,2,...,5. 


i=l i=1 
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Taking € | 0 and using invertibility of B gives the DAE Runge-Kutta equations: 
(6.1.72) 


v4) = Sie + ejh, xx +h ajive,i, Vx +h) ajiwy.i) j=1,2,...,8, 


i=1 i=1 


(6.1.73) 


0= g(t +cjh, xx +h} aye, Vx +h} °aj;wy) j=1,2,...,s. 


i=1 i=1 


The new x and y values are 


S 
Xig1 =X +h y bj VK, ;, 


j=l 


Visi = Meth > bjWx,j. 


j=l 


For a stiffly accurate method, b; = as, ; forall j andc,; = 1,80 g(te+1, ¥k415 Very) = 
0. That is, we can guarantee that the condition g(t, xx, y,) = Oforallk = 1,2,..., 
at least to the accuracy with which the equations (6.1.72, 6.1.73) are solved. There 
is certainly no drift occurring here. 

The Radau ITA methods are arguably the best Runge—Kutta methods for DAEs: 
they have near to maximal order. Only Gauss methods have higher order for the same 
number of stages, but Gauss methods are not stiffly accurate. 

To see how this works in practice, Figure 6.1.11 shows the errors for the pen- 
dulum problem formulated as an index 1, index 2, or index 3 DAE for the 3-stage 
Radau HA method. Note that different components of the solution have different 
error behaviors for the higher order DAE formulations. The slopes as estimated and 
predicted theoretically as shown in Table 6.1.8. The theoretical basis for these orders 
of accuracy was developed in [118, 135]. 


Table 6.1.8 Slopes and theoretically predicted orders of errors in positions, velocities and forces 
for DAE formulations of different indexes using the 3-stage Radau ITA method 


DAE index: One Two Three 
Slope Order Slope Order Slope Order 
positions 5.09 5 4.89 5 4.58 5 
velocities | 4.98 5 4.99 5 3.07 3 
forces 4.95 5 2.90 3 2.05 2 
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Fig. 6.1.11 Results for DAE formulations of the pendulum problem 


6.1.9.2 BDF Methods for DAEs 


The first methods developed specifically for solving DAEs were the BDF methods, 
with software developed including Hindmarsh’s LSODI [126, pp. 312-316] and Pet- 
zold’s DASSL [202]. For the DAE (6.1.60, 6.1.61), the equations to solve for the 
BDF method (6.1.39) are 


m—1 

(6.1.74) Xe = Si aj xej th Bf (ers, e415 Veg)» 
j=0 

(6.1.75) O = S(tet1, Xe+1, Vey): 


BDF methods can solve DAEs of index one and two and have the same order of 
accuracy m as for solving ODEs provided the starting values x ; have errors O (hi!) 
and yj have errors O(h’") for 7 = 0,1,...,m—1[117, Sec. VI.2 & VIL.3]. 
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Exercises. 


(1) Use Euler, Heun and the standard 4th order Runge—Kutta methods for the differ- 
ential equation dy/dt = 1 + y* with y(0) = 0 on the interval [0, 1]. The exact 
solution is y(t) = tant. Do this for step-sizes h = 2-* k=1,2,...,10. Com- 
pute and plot the maximum error max;-o<1,<1 | y(t) — ye| against h on a log—log 
plot. Empirically estimate the order of convergence for the three methods from 
this data. 

(2) Repeat Exercise | for the differential equation dy/dt = y, y(O) = 1 on the inter- 
val [0, 1]. 

(3) The Kepler problem is the problem of determining the motion of a particle of 
mass m around a fixed mass M at the origin under the inverse square law of 
Newtonian gravitation. That is, we want to solve the differential equation 


d’x 1 «x 
= m 
dt? Ixll5 Welle 


with given initial values for position x(0) = Xo and velocity dx/dt(0) = vo. 
Note that G is the universal gravitational constant. Since the orbit remains in 
the plane generated by the origin, x9, and v9, we can assume without loss of 
generality that x(t), v(t) € IR?. Solve this differential equation for GM = 1, 
Xo = [1, 0], and vp = [0, 1]7; also solve for GM = 1, x9 = [I, 0]’, and vp = 
[0, 0.2]”. Note that this 2nd order equation in R? should be first represented as 
a first order equation in R*. Do this for step sizes h = 10-*,k = 1,2,...,5, for 
an interval t € [0, 10]. Plot x(t) against rt, and also the orbit x(t) as a plot in R?. 
Report on how small / needs to be in order to obtain even moderately accurate 
solutions. 


(4) For Exercise 3, show that the energy E (x(t), v(t)):=5m || v(t) II3 —GMm/ |x|. 
is constant along any exact trajectory. [Hint: Show (d/dt)E(x(t), v(t)) =0 
using the differential equation.] Compute the energy as a function of time for 
the numerical solutions obtained in Exercise 3. Use this as a check on the error 
of the solution. 

(5) “Chaotic” systems have exponentially diverging solutions until the difference 
becomes large. Consider, for example, the Lorenz equations [166] 

dx 

dt — a(y ~ x), 

dy 

[Ties a ee 
dz 

ae xy — Bz, 


with o = 10, 3 = 8/3, and p = 28. 
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(6 


(7 


(8 


) 


Ym 


wm 


(a) Numerically solve the Lorenz equations using the standard 4th order Runge— 
Kutta method with h = 10~? with initial conditions Xp = [Xo, yo, Zo] = 
[1, -1, 1] and x% = [xo, yo, zo] =[1, -1+ 10°, 1]. 

(b) Plot the difference between the two numerical solutions over the interval 
0 <t < 50. Since the difference changes so much in size, use a logarithmic 
scale on the vertical axis. 

(c) Repeat (a) and (b) with h = } x 107°. 


A\ The Runge—Kutta—Fehlberg (6.1.59) method gives an estimate of the error 
per step which can be used to adjust the step size h with each step. Imple- 
ment the adaptive ODE solver Algorithm 63 and apply it to the Kepler problem 
(Exercise 3). Compare the number of function evaluations needed to achieve a 
change of energy of no more than 10~°, especially for the case where GM = 1, 
xo =[1, 0]’, and vp = [0, 0.2)”. 

Implement a general purpose solver for diagonally implicit Runge—Kutta meth- 
ods. To make it general purpose, one of the inputs to the function is a solver(t, a, 
B, f. Vf, p, Zo) function for solving z= y+af(t, y + Gz; p) for z. Here the 
p is a vector or some other structure of parameters for f. Include default solver 
functions for the fixed point iteration z,,4; = ytaf(t, y + 0Zm; p), and fora 
default guarded Newton solver. Note that the use of the solver function must be 
flexible enough to allow for functions f where the Jacobian matrix V f is not 
available, or where V f returns a representation or approximation of the Jaco- 
bian matrix as long as an appropriate solver function using the output of Vf is 
provided. Test this on the implicit trapezoidal method for the Kepler problem 
(Exercise 3). 

Consider the n-body mass-spring system shown below. 


me 


The differential equations for the displacements u ;(t) of particle j are 


2 
mje = kj-1(jaa — Hj) + kjlujr — Uy), PH lying dh, 
where we take up = 0 and ue,; = 0. This approximates the wave equation 
M &u/dt? = K 0?u/Ox? for large ¢ if m; = M/€ andk; = K £ forall j. Solve 
this using Heun’s method and the implicit trapezoidal rule for M = K = 1 and 
£=2' with r=1,2,...,10. Use this to determine the maximum step size 
needed for Heun’s method for stability empirically for each @ used, and com- 
pare this with the theoretically predicted stability limits. What happens with the 
solution using the implicit trapezoidal rule with 1/0 <«<h « 1? 

AX Read about the Courant—Friedrichs—Levy (CFL) condition for numerically 
solving the wave equation. Strictly speaking, the CFL condition only applies for 
a particular explicit method for solving the wave equation, but not for implicit 
methods. Discuss why the CFL condition might be useful to keep in mind even 
when using implicit methods. 
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(9) Solve the three-dimensional double pendulum problem as a system of differential- 
algebraic equations 


d°x x| x2 —X 
mi|—> = —mge3—T) + 13 , 
dt 1X1 Ilo x2 — Xj |l2 
m dx2 m2ge T: i aed 
2 = 2g€3 — Th ; 
dt? x2 — X4[l> 


2 2 2 2 
Ly = |lxillz, and L5 = ||x2 — x1 |3. 


In this system, g is the gravitational acceleration, mj is the mass of particle j, 
L; is the length of the rod connecting to particle j, and T; is the tension in the 
rod of length L ;. Here the m;’s and L;’s are constants and the T;’s are algebraic 
variables. Note that the problem can be reformulated to use modified tensions 
i = T,/ ||x,||, and ep = T)/ \|x. — x,||2. Solve this for the specific case with 
x, =[1, 1, OJ” and x2 = [0, 1, —1]’ and zero initial velocities (L; and L> are 
determined consistent with this data), and g = m; = | forall j. For the method, 
use the 3-stage Radau IIA method, solving the problem as an index 3 DAE. 

(10) Show that the seventh order BDF method fails to be stable even if f(t, y) is 
identically zero. 


6.2 Ordinary Differential Equations—Boundary Value 
Problems 


General boundary value problems (BVPs) have the form: given f: R x R’ > R", 
g: R’ x R’ > R’, anda < dD, find the function x: [a, b] > R" satisfying 


d 
(6.2.1) 7" = f(t,x), g(x(a), x(b)) = 0. 


Particular examples that commonly arise have the form 


d’y dy 
6.2.2 et 6 ee =y,, b) = 
(6.2.2) A fy a ya=y yb) =y, 


for given values y, and y,. An example might come from looking for steady state 
solutions of diffusion equations. Consider, for example, the concentration c(t, x) of 
oxygen in a one-dimensional medium in which it can diffuse, and is absorbed at a 
rate proportional to its concentration: 


de _ ae b 
br OR 
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If we assume that oxygen is supplied at x = L and there is an impermeable boundary 
at x = 0, then 


a 
O61) = cea 0 and 60/0, 
Ox 


The steady state then satisfies the BVP 


dc dc 

(6.2.3) De = be, c(L) = Cena, Fm = 0. 
Existence and uniqueness of solutions often depend on the specific problem. Gen- 
eral existence results usually require some topological assumptions. For example, if 
x? f(t, x) < Oforallt and x where ||x|| = R, then ||x(a)|| < R implies ||x(b)|| < R. 
The map x(a) + x(b) is continuous. Provided ||h(z) || < R whenever ||z|| < R, the 
function x(a) +» h(x(b)) has a fixed point by Brouwer’s fixed point theorem. 

Other cases should be considered from the point of view of optimization problems 
in the calculus of variations: 


(6.2.4) [ re w, wat 
“ i ae 


a 


(6.2.5) subject to ya) = y,, yb) = yz. 


This is essentially the case with the problem of oxygen concentration shown above. 
The solution of (6.2.3) minimizes 


i dc\? 
i (» () + se) dx subject to c(L) = co. 
0 dx 


Solutions exist for (6.2.4) with or without (6.2.5) provided F(t, y, v) is continuous 
in (t, y, v), is convex in v, and coercive in (y, v): that is, F(t, y, v) > oo if || y|| + 
||v|| — oo. Necessary conditions for a minimizer of (6.2.4, 6.2.5) are the Euler— 
Lagrange equations provided F(t, y, v) is smooth: 


d dy dy 
6.2.6 —V,F(t, y, —) —VyF(t, y, —) = 9. 
(6.2.6) at deg) — Ve) 

For numerical methods, we can go in two different directions. One is to treat 
the problem as a one-dimensional partial differential equation (PDE) and use those 
techniques. Another is to use the technology of initial value problems to solve the 
boundary value problem. 


Example 6.10 As a test problem, we will consider the problem of minimizing the 
area of a surface of revolution about the x-axis, as illustrated in Figure 6.2.1. It is the 
shape of a soap film between two co-axial rings. 


The quantity to minimize is 
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b 
d 
=i ,[1+ (2 20 y dx, 
é dx 


subject to the condition that y(a) = y(b) = yo. The solution must satisfy the Euler— 
Lagrange equations for this integrand, which are 


d y(dy/dx) 
1 dy/dx)* y = 0, 
oe ( Barna a] v1+ (dy/dx)* y or 


dy,, (dy/dx)° 


dx J1+ (dy/dx)* 


dy 1 Oi \ dy.,\ dy, 
ff) = (14) P| 


ody 


yar (2 aie 1+( 


Writing this as a two-dimensional system with v = dy/dx, we have 


dy 
a9 ei 
(6.2.7) 7 a 
dv 1 
(6.2.8) ~ = -[l+0"] 


6.2.1 Shooting Methods 


Shooting methods aim to use the solution map S: x(a)» x(b) of the differ- 
ential equation dx/dt = f(t,x) to help solve the problem. The equation 0 = 
g(x(a), x(b)) canbe written as 0 = g(x(a), S(x(a))). For this we can use Newton’s 
method, for example. This requires computing an estimate of the Jacobian matrix 


Vx(a) [g(x(a), S@(@)))] = Vx, 8(*@), S(X(@))) + Ve. 8X (a), S(X(@)) VS(x@). 


We need to determine VS(x(a)). Let x(t; x9) be the solution of the differential 
equation dx/dt = f(t, x) with x(a) = xo. If ®(t) = Vx,x(t; Xo) then 


d d 

ao = a or Xo) = 
= Vz, [f (t, x(t; x0))] 
= Vif (t, x(t; X0)) VX (t; Xo) 

(6.2.9) = Vx f(t, x(t; X0)) P(t). 


Vx0 77 x(t; Xo) 


6.2 Ordinary Differential Equations—Boundary Value Problems 443 


Fig. 6.2.1 Solution to soap 2 
film boundary value problem 


0.5 


1 -0.5 0 0.5 1 


This is the variational equation for the differential equation dxdt = f(t, x). For 
initial conditions for ®, we note that x(a; x9) = Xo. So ®(a) = Vy,xX(a; Xo) = 
Vx,.Xo0 = 1. Then x(t) and ®(f) can be computed together using a standard numerical 
ODE solver applied to 


d|x]_| f@,-x) x(a) | _ | xo 
2m) dt B ~ Es x)® |? @P(a)| | 7 | 
This does require computing the Jacobian matrix of f(t, x) with respect to x. This 
can be done using symbolic computation, numerical differentiation formulas (see 
Section 5.5.1.1), or automatic differentiation (see Section 5.5.2). 


Once ®(t) has been computed, VS(x9) = ®(b), and we can apply Newton’s 
method. 


Example 6.11 For a concrete example, we solve (6.2.7, 6.2.8) with the boundary 
conditions: y(a) = y(b) = yo witha = —1,b = +1, yo = 1.6. We use the guarded 
Newton method to compute u(a) = vp so that y(b) = yo; the variational equations 
(6.2.10) are used to compute the derivative 0 y(b)/Ovo needed for the Newton 
method. Starting from the estimate vg = 0, the computed value of vp ~ —1.0096 
for which |y(b) — yo| < 1077 is obtained in five Newton steps. No backtracking is 
necessary for this example. The resulting function y(x) is then computed using the 
values (yo, Uo). The standard fourth order Runge-Kutta method with step size 10-2 
is used throughout for solving the original and variational differential equations. The 
resulting trajectory is shown in Figure 6.2.1. 


There are actually two solutions for this value of yo, and no solutions if yo < yj © 
1.508879. 


Example 6.12 Shooting methods can obtain highly accurate solutions to boundary 
value problems of this type. However, they can also become numerically unstable if 
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the differential equation is unstable as an initial value problem. Consider for example, 


dy 
(6.2.11) — = 100y, y(0O) = y(10) = 1. 
dx? 


This is an example of (6.2.3) where b/D = 100. Using a shooting method means 
solving 

dy dy 

ZTJ7 = 100 9 0 = 1, — 0 = 

2 y, yO) qq 0) = 0 


where we wish to find up by solving for y(10) = 1. The exact solution of the above 
equation is 
7 1+ Yo/10 0x 4 b= vo/10 10x. 


y(x) 5 5) 


Putting x = 10 we see that 


y(10) = Lae eee 4 [> =. 


= cosh(100) + 7 sinh(100), 


so the derivative is 0 y(10)/Ov9 = sinh(100)/10 © 1.34 x 104. Any perturbation 
in vo is amplified by this amount. Even if the error in vg is of size of unit roundoff for 
double precision (u + 2.2 x 107!°), the computed solution will probably be in error 
by about 1.34 x 10° x 2.2 x 107!° = 3 x 10°. Yet the boundary value problem is 
actually very well conditioned as a boundary value problem. The exact solution is 


__ (yD) =e y(1O))e=19* + (y(10) — ey (O)eF100— 
- —200 


y(x) 


l-e 
wy y(0) e 10x ait y(10) e7 10d0—x) | 


with the approximation being within about e~!°? max(|y(0)|, |y(10)|) of the exact 
solution. 


6.2.2 Multiple Shooting 


The most important issue with shooting methods is that the condition number of 
the Jacobian matrix ®(f) in the variational equation (6.2.10) typically grows expo- 
nentially as t + oo. This can result in extremely ill-conditioned equations to solve 
for the starting point. We can avoid this extreme ill-conditioning by sub-dividing 
the interval [a,b] into smaller pieces a = fo < tj <--- < t, =. Then to solve 
g(x(a), x(b)) = 0 where dx/dt = f(t, x) we now have additional equations to 
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satisfy so that x(t7) = x(t; ) at every interior break point tj, j = 1,2,...,m—1. 
We have the functions S$ ;(x;) = z(tj41) where dz/dt = f(t, z(t)) and z(t;) = x;. 
As for the standard shooting algorithm, VS ; (x ;) can be computed by means of the 
variational equation (6.2.10) except that VS;(x;) = ®;(tj41) where ®;(t;) = J. 
Provided we make L |tj;,; —t | modest, the condition number of each ®(t;+1) 
should also be modest, and the overall system should not be ill-conditioned. The 
overall linear system to be solved for each step of Newton’s method for solving 
g(x(a), x(b)) = Vis 


(6.2.12) 
}. 
~VSo(x0) I ie 
—VSi(x,) I be 
—VSm—1(%m—1) I , 
OX m1 
Vx 8 (X0, Xm) Vx, 8 (X0, Xm) pas 
X1 —Xo 
Xy7—X, 


Xm—2 — Xm-1 
Xm-1 — Xm 
(Xo, Xm) 


Ill-conditioning can still occur, but then it will be inherent in the problem, not an 
artifact of the shooting method. Also, the multiple shooting matrix is relatively sparse, 
so block-sparse matrix techniques can be used. For example, we can apply a block 
LU factorization to the matrix in (6.2.12), utilizing the block sparsity of the matrix. 
If x(t) € R” then (6.2.12) can be solved in O(mn?) operations. Of course, LU 
factorization without pivoting can be numerically unstable. On the other hand, a 
block QR factorization can be performed in O(m n>) operations without the risk of 
numerical instability, and the block sparsity of the matrix is still preserved. 


6.2.3 Finite Difference Approximations 


Another approach to boundary value problems for ordinary differential equations, is 
to directly approximate the derivatives. This is particularly useful for second-order 
differential equations, such as the equations for diffusion (6.2.3). The basic idea is to 
approximate d? y/dx? with the finite difference approximation (y(x +h) — 2 y(x) + 
y(x —h))/h? = d*y/dx?(x) + Oth’). If we use equally spaced points xj =at+ 
jh,nh = L, then (6.2.3) 
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—~=hbec becomes 


Cjq1 2¢; Cy 
h2 


(6.2.13) D = be;, fH 1,2... n= 1. 

We need boundary conditions for the two end points, and these are c(L) = Cena 
and dc/dx(0) = 0. Discretizing these in the obvious way, we get Cc, = Cenq and 
(cy — co)/h = 0. Note that the second equation uses the one-sided difference approx- 
imation. Using the centered difference approximation is not useful here as that would 
require c_;, which is not available. This gives a linear system 


-I/h  41/h A P 
D/h2 —2D/h2 +b D/h2 “ 
Cl 0 
D/h2 -2D/h2+b ”. c2 | = 0 
D/h? Ea 
D/h2 —2D/h2 +b | L&r-1 —Dena/h 


Multiplying the first row by D/h gives a symmetric matrix A;,. Provided D, b > 0, 
—Ay, is also positive definite. To see that — A; is positive definite, note that 


n—2 n—-1 
—c" Aye = (D/h’) | Yo (ej41 —€)) tG_-1 | HOD. 
j=0 j=l 


The condition number of A;, is O(h~7). There is also the bound | A;! | 5 <4L?/(n*D) 
for all h. Solving this linear system can be done using standard sparse matrix tech- 
niques. 


Exercises. 


(1) Implement the variational equation solver for (6.2.7, 6.2.8). That is, minimize 
the discrete approximation 


n—-1 _ 2 Ey 
- fae (CH! ) at Vk h 
a h 2 


of the integral ri JV1+ (dy/dx)* 2xy dx subject to the boundary conditions 
that yo = y, is fixed. As usual, h = (b — a)/n. Use any available optimization 
methods and software. Do this forn = 2‘,k = 1,2,..., 10, and yo = 1.6. Plot 
the maximum error in the solution against h; the exact solution has the form 


y(x) = (1/3) cosh(Bx) where (1/3) cosh(3) = yo. 
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(2) Consider the problem of finding periodic orbits x(T; a) = a of an autonomous 


(3 


(4 


wm 


wm 


differential equation: dx /dt = f(x) and x(0; a) = a. Show that if x(To; ao) = 
ay then V,x (To; ao) f (ao) = f (ao). From this, show that V, [x (Jo; a) — a] is 
not invertible at a = ao. 

Continuing the previous Exercise, note that the period T is really an additional 
variable, and so we should add an additional equation. If n? f (ao) 4 0, we can 
solve the equations 


0=x(T;a)-a 
c=n'a 
for computing both a and T, using a convenient value of c. Using the variational 


equation (6.2.10), implement a Newton method for finding (a, T) generating 
periodic orbits. Apply this to finding periodic orbits for the van der Pol equation 


dv 
dt? 


d 
(6.2.14) pd — v?) - +v=0 


d 

with 44 = 2. Since the periodic orbits of the van der Pol equations are limit cycles, 
solving the van der Pol equation forward in time can give good starting points 
for the Newton method. 

The Blasius problem is related to the asymptotics of viscous fluid flow over a 
plate, and is the third-order differential equation y’” + 5 y y” = 0 with boundary 
conditions y(0) = y’(0) = O and y’(co) = 1. The boundary condition “at infin- 
ity,” y’(oo) = 1, can be approximated by y’(L) = 1 with L large. Use a shooting 
method to solve the Blasius problem (the value of y” (0) is the quantity to solve 
for). Choose different values of L and discuss the convergence of the solution 
as L becomes large. 


(5) A\ In this Exercise, we look at a method for finding geodesics on a surface; that 


is, curves of minimal length between two points on the surface. Suppose that a 
surface is given by a scalar equation g(x) = 0. This can be represented by the 
constrained optimization problem: 


fll idx |? 
(6.2.15) minimize — |—(t)|| dt 
9 2] dt 2 


subjectto g(x(t))=0 ~ forallt 


x(0)=x9, x()=*x,. 


Using a Lagrange multiplier approach with Lagrange multiplier A(t) for the 
constraint g(x(t)) = 0, we obtain 
d’x 


at AM)Vg(x(t)) = 0. 
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By differentiating g(x(t)) = 0 twice, show that 


la d d 
O= Vee)” (0) Es — (#)" Hess g(x(0)) = (t). 


Use this to obtain an equation for A(t) in terms of Vg(x(t)), Hess g(x(t)), and 
dx /dt(t). Develop a shooting method for finding vg = v(0) = dx /dt (0) to hit 
the target x(1) = x,. The differential equation solver can enforce the constraints 
g(x(t)) = 0 and Vg(x(t))" (dx /dt)(t) = 0 at the end of each step to prevent 
“drift” away from the surface. Apply this to the problem of finding geodesics 
on an ellipse (x /a)? + (y/b)? + (z/c)* = 1. The special case of a= b = cisa 
sphere, and geodesics on a sphere are “great circles” (circles on the sphere that 
are centered at the center of the sphere). 


6.3 Partial Differential Equations—Elliptic Problems 


Partial differential equations come in a number of different essential types, which 
are best exemplified in two spatial dimensions below: 


Cu Ou ; ; 
(6.3.1) Ax? + Dy2 = f(x, y) (Poisson equation) 
(6.3.2) ny cn CA (t,x,y)  (Diffusi fion) 
3. oe xt t Dye f@t,x,y iffusion equation 
2 2 2 
(6.3.3) “" =c & + =n) + f(t,x,y) (Wave equation) 


The Poisson equation is an example of an elliptic partial differential equation; the 
diffusion (or heat) equation is an example of a parabolic partial differential equation; 
while the wave equation is an example of a hyperbolic partial differential equation. 

To understand the difference between these different types, consider u(x, y) = 
exp(i(k,x + kyy)) for the Poisson equation, and u(t, x, y) = exp(i(k;t + kyx + 
ky y)) for the diffusion and wave equations. The corresponding f(x, y) and f(t, x, y) 
that gives these solutions are 


f@ y= —(k2 + k5) exp(i(kyx +kyy)) for Poisson equation, 
S(t, x, y) = (ike + Dike + k3)) exp(i(kyt +kyx +kyy)) for diffusion equation, and 


f@,x, y= (—k? +e (ke + k5)) exp(i(krt +kxx +kyy)) for the wave equation. 


The wave equation is different from the others as if ae (kK. +k) then f(t,x, y) = 
0. This means that information can travel in the direction k = (k;, ky, ky) in the 
solution u(t, x, y) even with f(t, x, y) = 0 for all (t, x, y). 
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For the diffusion equation, we get k; imaginary for k, and ky real: k, =i D(k2 + 
ky) so exp(i (kit+k,x + kyy))= exp(—D(ky +k) t + i(kex + kyy)) which decays 
exponentially as ¢ increases. This means that high frequency components of u(t, x, y) 
decay rapidly as ¢ increases. Flipping the sign of Ou/Ot changes rapid exponential 
decay to rapid exponential growth, which is very undesirable. So the sign of Ou/Ot 
is very important for diffusion equations. 

For the Poisson equation, if k = (k,,ky) #0 a component of the solution 
u(x, y) of the form exp(i(k,x + kyy)) must be reflected in f(x, y). Further- 
more, the coefficient of exp(i(k,x + kyy)) is —1/ (k2 + ke) times the coefficient 
of exp(i(k,x + kyy)) in f(x, y). So the coefficient of a high frequency component 
((k,, ky) large) in the solution is much less than the corresponding coefficient of 
f(x, y). Thus the solution u(x, y) is generally much smoother than f(x, y). 

This classification can be made more sophisticated and more precise through 
Fourier transforms. More details can be found in [213, 245], for example. 

In this section we will focus on equations like the Poisson equation, called ellip- 
tic equations. Elliptic equations are linear partial differential equations A u(x) = 
f(x) where u(x) = exp(ik’ x) gives Au(x)=s(x,k)u(x) and for each 
x, |s(x,k)| > co as ||k|| > oo. The function s(x, k) is called the symbol of the 
partial differential equation, and contains valuable information about the equation 
beyond just its classification. 

It can be very helpful to use the operators and theorems of vector calculus to work 
with partial differential equations. In particular we use the divergence operator of a 
vector function: 


n 
Ov; 


(6.3.4) div v(x) = : 
Ox; 


j=l 


the gradient operator of a scalar function 


Ou Ou ; 
V = [—— — ee 
u(x) laa (x), Bis (X),..-, On, (x)]", 
the Jacobian operator of a vector function, 
Ov, Ov, Ov, 
Bx On eae an (x) 
ay thas 
—— x —=— x eee x 
Vo(x) = | Ox Oxy OXn 
avn OUm . DU 
Ox, (x) Onn”? ae (x) 


and the curl or rotation operator for functions v(x) € R? with x € R’, 
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Ov3 Ov2 Ov, Ov3 Ov> 


Oxo () Ox3 (), Ox3 @) Ox (2), Oxy 


Ov, : 
V x v(x) = (x) (x)} . 
Oxo 
The most important theorem of vector calculus is the divergence theorem: for any 
region Q C R" that is bounded and has a piecewise smooth boundary, and v contin- 
uously differentiable, 


(6.3.5) / div v(x) dx = i: v(x) - n(x) dS(x) 
Q a2 


where OQ is the boundary of Q, n(x) is the outward pointing normal vector (per- 
pendicular to the boundary at x), and dS(x) indicates integration over the surface 
area of OQ. 

The equations given above can be easily represented in terms of these operators: 


div Vu = f(x), Poisson equation, 
) 
i = Ddiv Vu(x) + f(t, x), diffusion equation 
ru oy: : 
a2 =c div Vu(x) + f(t, x), wave equation. 


An important family of differential equations are in divergence form: 
(6.3.6) div F(x, u(x), Vu(x)) = f(x). 


Equations in this form can be thought of as representing a physical situation where 
F(x, u(x), Vu(x)) is a flux, or flow, vector of a conserved quantity }(x). This insight 
can be useful for designing numerical methods, especially where the conserved quan- 
tity should be conserved by the numerical method. 


6.3.1 Finite Difference Approximations 


We start with the Poisson equation in two dimensions (6.3.1) in a region Q: 


Ou Pu f e 
ae az 2 or (x, y) € Q. 


For simplicity we suppose that we have Dirichlet boundary conditions: u(x, y) = 
g(x, y) for any (x.y) € OQ, the boundary of Q where g(x, y) is a given function. 
Supposing that Q C [a, b] x [c,d] we setx; =a +ihandy; =c+jhwhereh > 
0 is acommon fixed spacing. We can use the approximation of the second derivative 
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. h —2 - —h 
Say) = ER eet + 00%) 


to give an approximation to the left-hand side: 


u(x +h, y)+u(x,y +h) —4u(x, y) + u(x —h, y) — u(x, y —h) 


2 = f(x,y) + On’). 


Setting x = x; and y = y; gives 


U(Xj+1, Yj) UO; Viti) — 4UuQ;, yj) + UQj-1, 97) — UO, Yj-1) 
h2 


= f(x,y) + OM’). 
Using the computed quantities u;; ~ u(x;, yj) we have 
(6.3.7) 


Wisi; + Ui jt1 — 4Mij + Ui-1,j — Mii 
h2 


= f(x, ys) for (xj, yj) € Q. 


If (x;, yj) ¢ Q we set uj = g(x, yj). 

If we set Q, = { Gi, j) | Gi. yj) € Q} to be the discrete domain, then we can 
consider our unknowns u;, ; for (i, 7) € Qn asavector uj, € IR® with indexes (i, JeE 
Q;,. Then the system of equations (6.3.7) is represented in matrix—vector form as 


Anuy = fy, — kn where (f,)c,;) = fi, y;), and 


Bit; + Bi j+1 — 41,5 + Bi-1,7 + Bi,j-1 


(kni.j) => rm) ; where 
gi oe (Xi, yj); if G, jf) € Qa, 
ae EG if (i, j) € Qh. 


This allows us to incorporate the boundary conditions into the linear system. 
The matrix A, is given by the formula below for (i, 7), (k, £2) € Qh, 


—4, if Gj) =(k, 2), 
Anip«é¢o=ytl ifGA=kltDorg)y=k+1,8, 
0, otherwise. 


It is easy to check that A; is symmetric (that is, (Ay) @,),,¢ = (An) (k,0),@,) for all 
(i, j), (k, £) € Qy,). Also, — Aj, is positive definite as 


1 
—Z) AnZh = 5 oD [(z.s - ne + (zi; - wag) 
(i, EQh 


+ (zi,j _ Age) + (zi,j = zij-1) | , 
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Fig. 6.3.1 Solution to 
Poisson equation 


taking the value of z;; = O for (i, 7) ¢ Qy. The value of —z] AnZh > Oasitisasum 
of squares. Also Zp AnZh = O implies that for all @, j) € Q), we have z;,; = zj41,; = 
Zi,;+1- Because (2; is finite, we must therefore have z;,; = 0 for all (i, j) in this case. 
Thus — Aj, is symmetric positive definite. 
Results for the region Q= { (x,y) | (2+ y? < land (x > Oory > 0)) 
or 0 <x, y < 1} with u(x, y) = exp(x — y) on OQ, h = 1/50, and f(x, y) = 1 
are shown as a three-dimensional plot in Figure 6.3.1. 


6.3.1.1 Convergence Proof for Poisson Equation 


Convergence can be proven for these finite difference methods as h | 0. Note that if 
w is the vector in R®* given by (u);,; = u(x;, y;) for the exact solution, then 


Anun = fn — Kn, 
A,u = f,—kn +m, where 
Cu Ou 


(7)i,; = (Ant), — (SS + Dy? 


) Gay) =O). 


The hidden constant in O(h’) depends on the fourth derivatives of u. Subtracting the 
equations for w and u;, gives 


An(u — Un) = Ny. 


We can see from the above calculations that ln, ll. = Oth’) provided u has con- 


tinuous fourth derivatives. Our task then is to show that | A;! | ae is bounded, inde- 
pendently of the grid spacing h ash | 0. 
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First, we show that the entries of —A;,' are non-negative. We can write A, = 
h-*(—4 1 + E;,) where 


1, £@G PA =Kk+L1OoG, =k LED 


0, otherwise, 


(Eni.p.&e) = 


for (i, j), (k, €) € Qh. If we consider solving the equation —A,w = b with b > 
0 in the sense that by, ;) > 0 for all (7, 7) € Qr, we can express this in the form 
4w = h* b + E,w. This can be solved by the iterative method w+!) = (h/2)*b + 
En w , This is aconvergent iteration provided p(E;,) < 4. Since every entry of E;, is 
zero or one, and there are at most four non-zero entries per row, p(E;,) < ||Enllo <4 
by Theorem 2.16. We now show that p(E;,) 4 4. Since E;, is a real symmetric 
matrix, it has real eigenvalues. So we just need to show that neither +4 nor —4 
are eigenvalues of E,. Suppose E,€ = +4€, € 4 0. Choose (i*, j*) € Q, where 
léc.| = Max(, j)eQ, [Ec Then, taking (k,0) = Oif (k, £) ¢ Qh, we have 


EA Ecim je) = Eve gi jy) + Eue-1, 7%) + Eas prgy + usp, 80 
4 |i jx) + |Eie—1,3] + [ae jetn| + 


s [Ece1,7% Ein, jx—1)| 


<4 Evin, 2) 


This can only occur if EG, j*) => EG*41, j*) => €i*-1, 7 => Eu jet) => Eis, j*—1)- Apply- 
ing the above argument to (i* + 1, j*), @* — 1, j*), @*, j* + D, and @*, 7* — 1D, 
we can see that as long as Q,, is connected we have €(x,2) = (i, ;+) for all (k, £) € Qh. 
Eventually we will have (i*, j*) € Q, but there is some (i* + 1, j*) or (@*, j* + 1) 
that is not in Q,, and we will have €(«, ;», = 0. This means that € = 0 and € would not 
then be an eigenvector of E;,. Thus p(E;,) < 4. Then by Theorem 2.15, the iteration 
wht) — (h/2)*b + tE,w converges to a solution of —A,w = b: 


= po Bi: le od le . a A:'b 
w= 4 h 4 h 4 h 5 = h |: 


Since each term of the infinite sum for Aj! is a matrix with non-negative entries, 
—A;! is a matrix with non-negative entries. 
We want to obtain a bound on | A; | x. based on the diameter of 2 and the grid 


spacing h. To do this, we compare | A; | oo for Q with | (Aj)! | {ora larger region 
Q’ where this can be computed explicitly. 

Consider expanding the discrete domain Q, C Q),: If Aj, is the matrix for Q), and 
Ay, the matrix for Q), then 


1 
Ai, = peal +E,) and 


1 
An = pet + En) where 
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gia| Fe (E),)12 
LEW) i2 (Ep)22 


if we order the rows and columns for (7, 7) € &, before the rows and columns for 
G, fj) € Q\ Qh = 1G. DAG J) € Q;, but @, 7) € Qh i; Since all these E matrices 
have non-negative entries, 


! En 0 : 
E,, = 0 °| entry-wise. 


This means that 


acd Lig  Vb2v 2 fleey nh? 
—(Aj) = T+ 7E, + gen + gen Ee 


: [li rea + (tEs) + (2s) + | 4 
0 0 


=I 
= ie °| entry-wise. 


So | (Aj,)7! | ees Az" leo: We can now look for a suitable expansion Q), of Q;, for 
which we can compute a bound of | (Aj)! | ao that is independent of h > 0. 

We can take Q), to be essentially a circle: Q), = { (i, j) |i?+j? < N7} 
and set R= h(N + 1). Since —Aj, has only non-negative entries, (Aj)! ll = 
| (Aj) Tell where e is the vector of ones of the appropriate size. The vector 
(A,)~‘e is the solution of the discrete Poisson equations (6.3.7) for the square Q’ 
with f(x, y) = 1 for all (x, y). In fact, if b > e entry-wise, then ||(Aj,)~'|, < 
| (Aj,)~'b|| .- The finite difference approximation is exact for quadratic functions. 
Let w(x, y) = R? — x? — y and w(,;) = w(x;, y;). Then provided (i, j),(@ + 1, j), 
and (i, j + 1) all belong to Q,,, 


Wi+1,j + Wi, 41 4w(,; + Wil, j + Wi, j—1 Ow O-w 
G+1,/) G@j+) oe G-1,/) Gir) _ D2 (xi, y;) - (xj, 9)) =4. 


If (i, 7) € Q), but some neighbor (i + 1, j) or (i, j + 1) is not, then because P+ 
PSN’, G+)? 47? <N?4+2]i/4+1< (N+ VD’, it follows that wi41,)) > 0. 
Similarly we can show that w, ;+1) => 0. Insuchacase (—A;,w),(j,;) => 4.So—A,w > 
4e entry-wise. Setting b = (—A;,w)/4, we see that b > e entry-wise. Then 


[AD Noo SAID) cg = I-09 /Alloo < R74. 


This bound is clearly independent of h, and so 
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) h?. 
CO 


Not only have we shown convergence, but we have proved the rate of convergence 
to match the consistency order of the method, at least provided the exact solution is 
smooth. This is not always the case for interior corners that “cut into” the region, 
such as the origin in Figure 6.3.1. 

There are many variants of the Poisson equation, and the methods of showing 
convergence can be modified to deal with these situations. The hardest part of these 
proofs is showing that the numerical method is stable in a suitable sense. For the 
Poisson equation, it comes down to showing that | AS | xo 18 bounded independent of 


Otu 
Oy* 


It — alloc < |(AK "To Ines $ GR ve (| ax? |. 


h > 0. The method of proof relies on the fact that — Ay has only non-negative entries, 
which can be traced back to the corresponding property of the Poisson equation: if 
—(u/Ox* + u/dy*) > 0 in Q with u(x, y) = 0 on OQ, then u(x, y) > 0 for all 
(x,y) € Q. 


6.3.1.2 Conservation Principles 


Many partial differential equations come from physical problems or can be under- 
stood as modeling some physical process. Typically, some quantities are conserved, 
like energy, mass, and momentum. Conservation of a quantity does not necessarily 
mean that the total amount of that quantity remains fixed. But, it does mean that we 
can identify net production (or consumption) of that quantity as a function of time 
and space. For the Poisson equation in two dimensions, and its variants, consider the 
diffusion equation 


Ou Cu Ou : 
(6.3.8) ea O(Sst Ss +) + FG, y) mQ, 
(6.3.9) u(x,y)=g(x,y)  for(x, y) € OQ. 


Here u(t, x, y) represents the concentration of a certain chemical species, for exam- 
ple. Here D is the diffusion constant for the chemical species in whatever medium 
it is diffusing in. The function f(x, y) represents the net rate of production of this 
chemical species per unit area per unit time. 

Re-writing the diffusion equation (6.3.8) for constant D as 


(63.10) a den vie tee, 


we can identify D Vu as the rate of flow, or flux vector, for u. Note that (6.3.10) is in 
divergence form. The rate of flow isn’ D Vu per unit time per unit length or area of 
the boundary, where n is the unit vector perpendicular to the boundary. If we divide 
the region Q into cells as shown in Figure 6.3.2, then the net rate of outflow from 
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Fig. 6.3.2 Finite volume h 
cells 
<$—_§_ | 
Ui j+1 


yix1 @ | @ | @ 


Ui-1,j|  Ui,g | Uit1,7 


Lji-1| Ui | Vis 


the central cell is 


Uj j — Yi-1,j he pei 
h 


— Ui41,j Uj j — Yi,j-1 Uj, j — Yi, j+1 
fa pi ply phi itl 
h h h 


This can be equated to —(d/dt)(h? uj,j) + h? f(x, yj) as hui,;. This gives the 
equation 


d 
at ui,j) = D[ wisi, + ui j41 — 4ui,j + Mii; + ui,j-1] + h? f (xi, yj). 


For the steady state, we set (d/dt)(h? uj,;) = 0. Then 


Ui+i,j; + Ui, it+1 — 4uj,; + uj-1,; + Ui, ;—] 
D i J 1,J ol i J J a fi, yj), 


which is the discretization obtained by the second-order difference formula. The 
advantage of this formulation is in dealing with non-uniform grids, and varying 
diffusion coefficient D(x, y). We still need the partial differential equation to be in 
divergence form. 

As an example, suppose we keep uniform spacing of the grid points, but we have 
D(x, y). Set Djsi2,j = DG Qi + x41), yj) and Dj 41/2 = Dix, $09; + yja1))- 
Then the rate of net outflow from the central cell in Figure 6.3.2 is best approximated 
by 


i,j — 4i,j-1 


Hi, j — “iti j Ks 4i,j — 4i-1,j h 
T 
h 


D;_ * 1 
A i—1/2,j i 


+ Dj, j+1/2 h, 


Uj,j — Ui,j+l 
Di+1/2,) h h+ Dj,j-1/2 


Using this approximation we obtain the discretized equations 


1 
Re [ (Di+1/2, + Dj-1/2,; + Di,jrij2 + Dj, j-1/2) Ui, j 
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— Djsi2,jMisij — Dij+ij2ui,j41 — Di-1/2,;4i-1,j — Dj, j-1/2Ui,j-1| 


= f(%, y;)- 


If D(x, y) > 0 and continuous, this system has a unique solution u;,; if we set 
uj; = g(x, y;) for (%j, yj) ¢ &. This can be proven using the techniques of the 
previous section. 


6.3.1.3 Computational Issues 


The system of equations for the Poisson equation is a linear system. It is a large, 
sparse system of equations, with no more than five non-zeros per row in two spatial 
dimensions. The system can be solved by either direct or iterative methods. 

Since — Aj; is symmetric positive definite, sparse Cholesky factorization can be 
used to solve the system. In this case, ordering the rows and columns to reduce the 
amount of fill-in can be important to reduce both the amount of memory needed, 
and the time taken to form the factorization and to solve the system. One of the best 
approaches is to use nested dissection (see Section 2.3.3.2). Using nested dissection 
requires O(N°) floating point operations and O(N? log N) memory for an N x N 
grid. 

Alternatively, conjugate gradients (see Section 2.4.2) provide an iterative method 
that avoids the cost of the memory needed to store the fill-in. The condition number of 
—Ajp is k2(—Ap,) = O(h-~?) = O(N?), so the number of iterations needed to obtain 
an error of € requires O (J N2 log(1/e)) iterations. To achieve an error in the equation 
solver of O(h?) = O(N~7) (to be comparable to the error for the exact solution 
of the discretized equations) thus requires O(N log N) iterations. Each iteration 
takes O(N’) floating point operations as there are N? rows in A;, and each row has 
no more than five entries. Thus conjugate gradients without preconditioning takes 
O(N? log N) floating operations. Preconditioning can greatly reduce the number 
of iterations needed for conjugate gradients. In this regard, multigrid methods [33, 
248] arguably give near optimal performance with k2(B;,A,) = O(1) where By, isa 
multigrid preconditioner, bringing the total cost down to O(N? log N) floating point 
operations, although the hidden constant is also fairly large. 

Three-dimensional problems give even more advantage to iterative methods over 
direct methods, even without preconditioning. 


6.3.2. Galerkin Method 


There is another approach to solving partial differential equations that is much more 
flexible, based on triangulations instead of rectangular grids. The mathematical foun- 
dation is the Galerkin method, but this method is also known as the finite element 
method. The method is named after Boris Galerkin for his work on rods and plates 
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in 1915, although the mathematical underpinning was already developed by Walther 
Ritz in 1902. The history of the development of these methods is outlined in [99]. 
The method is based on reformulating the partial differential equation in its weak 


form. We take the Poisson equation as our example: 


div Vu(x) = f(x)  forx €Q, 
u(x) = g(x) forx € dQ. 


Then for any smooth function v(x) where v(x) = 0 for all x € OQ, 
[rw — div Vu(x)) v(x) dx = 0. 
Now 
[iww Vu(x)) v(x) dx 


= / {div (v(x) Vu(x)) — Voix)" Vu(x)} dx 

Q 

(using div(¢w) = Vo! wh + o dive for smooth ¢ and w) 
= / v(x) Vu(x) - n(x) dS(x) — / Vo(x)' Vu(x) dx 

aa Q 


Ou ; 
-| vx) Pan asexy— [ Vou(x)’ Vu(x) dx 
dQ On Q 


where n(x) is the unit outward pointing normal vector at x to the boundary 0Q. By 
“outward pointing” we mean that x + en(x) ¢ © for any sufficiently small € > 0. 
The “normal derivative” Ou/On(x) is n(x)’ Vu(x), the derivative of u(x) in the 
direction n(x). 

Since v(x) = 0 for x € OQ, oe v(x) (Ou/On)(x) dS(x) = 0. Then 


[ taiv vue) v0) ae = = [ voce)” vacate, so 
[rw — div Vu(x)) v(x) dx = [ [reve + Vu(x)" Vu(x)] dx = 0. 
The formulation 
(6.3.11) [ ten vedr = - [ vue" vu ae for all v(-) 


where v(x) = 0 on OQ, is called the weak form of the Poisson equation. 
The Galerkin method uses finite-dimensional spaces of functions V;, depending on 
some parameter h > 0 where the integrals i Ff (x) v(x)dx and tes Vu(x)! Vu(x) dx 
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are defined for every u, v € V;,. These spaces V;, must be subspaces of the Sobolev 
space H '(Q): 


(6.3.12) u € H'(Q) if and only if [i tu? + verve] is finite. 
Q 


The Galerkin method is to find u, € V;, where u;,(x) = g(x) for x € OQ and 


(6.3.13) / Ff (x) up (x) dx = -| Voun(x)’ Vu,(x)dx forall uv, € Vp 
Q Q 


(6.3.14) where v; (x) = 0 for x € OQ, and 
(6.3.15) un(x) = g(x) forx € OQ. 

Since V;, is finite dimensional, there is a basis {@, @2, ..., Oy} for V;,. AS uy, € 
Vn, we can write up, = y 4 uj;p;. We assume that fori = 1, 2,..., M we have 


oi(x) = 0 for x € OQ. If g is equal to a function g, € V, on OQ, then we can 


write g, = are 8jpj-SOUu, = eae ujoj t+ pear g;o;, with g; given. The 
Galerkin equations (6.3.13), taking v, = ¢; fori = 1, 2, ..., M are 


M N 
[,roreaar=-[.|3 eoieoam+ » ean won| dx 
j=l 


j=M+1 


M N 
=-Yy [ Voi(x) Voj(x)dx— > «i [ Voi (x)! Vb; (x) dx. 
j=l 2 j=M4+1 °2 
If we write aj; = — f Voi(x)’ Voj(x) dx, bi = JQ f(x) bi(x) dx and Ay = 
[aij |i, f =1,2,...,M], An = [ay |i =1,2,...,M, f= M+1,...,N] then 


by = Anttn + An&) 


where A, isan M x M matrix. We will see soon that under standard conditions, — A, 
is positive definite, and so is invertible. We can then solve the Galerkin equations for 
the unknown coefficients uj, j = 1,2,..., M: 


uj, = Aj" [bn — An&n| . 


We can create these subspaces V,, in different ways, although the most common is 
to use a triangulation 7;,, and V), is the space of piecewise polynomial functions over 
T;, of a specified degree that are continuous across the triangulation. See Section 4.3.2 
for more detailed discussion of triangulations. 

The boundary condition u,(x) = g(x) for x € OQ often cannot be enforced as 
specified if g on the boundary OQ is not equal to any function in V, on OQ. However, 
we can use g;,(x) being a polynomial interpolant or some other approximation of 
g(x) over the boundary 0, of the triangulated region Q, := U, ap 
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With a triangulation 7;, we can use various interpolation methods that are consis- 
tent across the triangles of the triangulation, such as piecewise linear interpolation. 
See Sections 4.3.1 and 4.3.2 for more examples and error estimates for interpolation 
over triangles and triangulations. 

We need the integrals ro Vujp Vvp dx to be well-defined for each uj, vy € Vp. 
From the theory of Lebesgue integration [220], /, g (x) dx is defined if and only if 
is |h(x)| dx is finite. Now |a’b| < |la|l2 ||B||2 by the Cauchy—Schwarz inequality 
(A.1.4). Sincers < $(r? +s) for any real r and s (coming from 0 < (r — s)?), we 
have |a’b| < 5 (lal + \|b||3). Thus, as long as f, || Vujll3 dx and fi. ||Vunl|3 dx 
are both finite, de Vuj Vu; dx is well-defined. So we can apply this approach as long 
as Un, v;, belong to the Sobolev space H!(Q) defined in (6.3.12) with norm given by 


1/2 
(6.3.16) lw lia) = (/ [u(x)? + || Vu(x) 3] ax) 
Q 
There is also an inner product that generates the norm: 


(6.3.17) (u, V)H(Q) = / [u(x) v(x) + V(x)" Vu(x)] dx, so 
Q 
WW) = VU, Waa). 


Because the H!(Q) norm is generated by an inner product, H!(Q) is a Hilbert space. 

An issue we seem to have is that Vu, is not usually defined everywhere. At every 
boundary between triangles, if we use piecewise linear functions, for example, the 
functions match on the boundary but the gradients typically do not. We can, however, 
consider smooth functions that approximate the piecewise linear functions u;. If we 
think about the situation in one variable we can smooth out the transition as illustrated 
in Figure 6.3.3. Note that the smoothed function u,,, does not have gradients larger 
than the maximum gradient of u;,, and the region on which uy, and uy, differ has a 
total area that goes to zero as « + 0. Then by the dominated convergence theorem 
[220, p. 26], 


[lvencobax > f Vun(x)zdx ase > 0. 
Q Q 


Note that this argument would fail if we were dealing with second derivatives. 
Differentiating a piecewise linear function results in jumps in the first derivative. 
Integrating a function with a jump is perfectly fine. But if we differentiate a function 
with jumps in the sense of distributions, we obtain Dirac 6-functions, whose squares 
are not integrable. While functions with jumps can be smoothed out, if we try to use 
smooth approximations u,(x) to u(x) with a jump we find that Ei e ull (x)*dx > © 
ase | 0. 

The matrix A; for the linear system for u;,(x) = ye uj; dj (x) is given by aj; = 
de Voi Vo ; dx. This matrix is clearly symmetric as a;; = aj; by symmetry of the 
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Uh,e 


Uh 
2€ 


Fig. 6.3.3. Smoothing of piecewise linear function 


inner product. What is more, the matrix is positive definite. To see this, we compute 


M 
zz’ A,Z = + ay f Vo) Vo; dx 
Q 


i,j=l 


M M 
i Yo Vedi Dee Vo; | dx. 
Q jal 


i=1 


Setting z,(x) = eae z; bj (x), we see that Vzp(x) = ae z; Vo; (x), and so 
z’ Anz = Vzn (x)! Vzq(x) dx = Vz, (x) Il dx > 0. 
Q Q 


Furthermore, z’ A;,z = 0 can only happen if Vz, (x) = 0 for all x except those ina 
set of zero volume (or area). Since Z, is piecewise linear and continuous, this implies 
that z, is constant. As z;,(x) = 0 for all x € OQ, that constant must be zero. That is, 
z = 0, and thus, Ay, is positive definite. The system of equations is therefore solvable. 
However, to show that this solution method is reliable, several more issues must be 
dealt with. 

We need to understand under what conditions as h — 0 the computed solution 
converges u; — u to the exact solution, and in what sense. The linear system is 
solvable in exact arithmetic, but we want need to know how perturbations due to 
roundoff error, integration error, and other sources affect the numerical result. This 
will involve looking at the condition number of A,. The third issue is how to gen- 
erate and use triangulations in order to efficiently compute the desired coefficients 
aij and b je 
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6.3.2.1 Existence and Uniqueness of Solutions 

The framework we use for developing the convergence theory uses a space of func- 
tions V from which we select a finite-dimensional subspace V, C V to which we 
apply the Galerkin method. For the Poisson equation, we take V to be H!(Q) as 


defined in (6.3.12). We need a bilinear forma: V x V — R that is continuous using 
the appropriate norm ||-|| on V. For the Poisson equation, 


(6.3.18) a(u, v) = [ vulvods. 
Q 


We also need the linear form b(v) = te f udx. The solution u € V then satisfies 


(6.3.19) a(u, v) = b(v) for all v € V where 
(6.3.20) v=0O  ondQ, and 
(6.3.21) u=g  ondQ. 


The operator taking u € H'(Q) to its restriction to the boundary OQ is known as 
the trace operator, and is often denoted u +> yu. In fact, y: H'!(Q) > H'/?(dQ) 
where H!/?(9Q) isa fractional order Sobolev space. Details of how exactly fractional 
order Sobolev spaces are defined, and why the trace operator has values in H!/?(0Q) 
can be found in [11, 31, 245], for example. 

We assume that the boundary function g can be extended toa function g € H!(Q): 
7% = g. In fact, the trace operator y: H'(Q) > H'/?(AQ) is onto, and so there 
is an extension ¢ € H!(Q) where yg = g whenever g € H'/*(9Q). But with this 
extension of g, a := u — g € H'(Q) is zero on the boundary OQ. Then the solution 
u Satisfies 


a(u — %,v) =a(u, v) — a(2, v) = bv”) — a(B, v), and 
u-—zg=0  ondQ. 


Let Vy = {ve V|v=0o0ndQ}. Then uw € Vo and 
(6.3.22) a(u, v) = b(v)—a(g,v) — forallu € VY. 
To solve an equation in the weak form # € Vo and 
a(i,v) =b(v) forall v € Vo, 


we need some conditions on the bilinear form a(-, -). The main condition we use is 
the condition that a(., -) is elliptic: there is a constant a > 0 where 


(6.3.23) a(t, i) >a |lally, forall” € Vo. 
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In the case of the Poisson equation, this means showing that for some a > 0, 
(6.3.24) a(u, a) = / | Vit(x)||5 dx > af [Vinx ||3 + a(x)? ] dx 
Q Q 


for any smooth function 7 on Q where u(x) = 0 for all x € OQ. The arguments 
needed to prove this involve functional analysis, specifically of Sobolev spaces [213, 
245], and are also summarized in [12]. 

Once it is established that the bilinear form a(u, v) is elliptic, we want to use this 
to show existence and uniqueness of solutions to (6.3.22). This can be done via the 
Lax—Milgram Lemma: 


Lemma 6.13 (Lax—Milgram Lemma) If a: V x V — R is a continuous elliptic 
bilinear form where V is a Hilbert space, and b: V + R is a continuous linear 
function, then there is one and only one u € V where 


a(u,v)=b(v) forall V. 


The proof of this requires some knowledge of functional analysis, which can be 
found in textbooks such as [11, 158, 245]. We assume the Lax—Milgram Lemma 
holds in what follows. 

Note that for a bilinear form a: V x V > R, a is continuous if and only if 
there is a constant M where |a(u, v)| < M |lully |lvl|y for all u, v € V. Continuity 
of a linear functional b: V — R is equivalent to there being a constant B where 
|b(v)| < B |lully forallv € V. 

For the Poisson equation, we can show the existence of a solution of the weak 
form (6.3.18) by applying the Lax—Milgram Lemma (Lemma 6.13) to the restricted 
weak form (6.3.22) where a(w,¥) = f{, Va" Vvdx, b() = fv f dx and Vo = 
{a € H'(Q) | yw = 0} where 7 is the restriction of 7 to the boundary OQ. Also, 
the linear form b(@) is continuous provided f? is integrable, as 1 €¢ H'!(Q) implies 


1/2 1/2 1/2 
~ 2 2 ~ 2 
[os <| [Pas] fe as| < lle Tz as| ; 


The solution of the original problem is then u = u + g, so that u(x) = Z(x) = g(x) 
for x € OQ. 


6.3.2.2 Convergence Theory 
The abstract Galerkin method for finding v € V where 


a(u,v)=b(v) _ forall V, 
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is to pick a finite-dimensional space of approximations V;, C V and find uv, € V;, 
where 


(6.3.25) a(uj, Vn) = b(vn) for all v;, € Vp. 


The foundation of the convergence theory of finite-element methods is Céa’s inequal- 
ity: 


Theorem 6.14 (Céa’s inequality). If a(u, v) is a continuous elliptic bilinear form 
on a Hilbert space V, then there is a constant C, depending only on M and a for the 
elliptic form a(., -), where the true solution u and the solution uy, of (6.3.25) satisfy 


lu — unlly < C min|lu — vy. 
ve Vi, 


That is, the error in the solution of the Galerkin method in the V norm is within 
a constant factor of the approximation error in the V norm by functions in Vj. 


Proof Suppose that a(u, u) > a ake for allu € V witha > 0 since a(,, -) is ellip- 
tic. Suppose also that |a(u, v)| < M |lully ||v||y since a(, -) is a continuous bilinear 
form. Then the Galerkin method (6.3.25) implies that uv, € V, where 


a(upn, Vn) = b(vn) forallv € Vj, while 
atu, v)=b(v)  forallue V. 


Then 


2 
a ||un — ully < a(u — up, U— Un) 
=ada(u—Uup;,U— vy)  forany vz, € V;, since 
a(uU — Un, Up — Vn) = AU, Up — Vy) — A(Un, Un — Vp) 
= b(upj — Vn) — D(Un — Vy) = O as uy, — Un EV, CV. 


Thus, for any v;, € Vj, 


2 
Q ||un — ully S a(u — Un, U— Up) 


< M lu — uplly lv — ally - 
Dividing by ||“ — ua||y (assuming that this is positive) we get 
@ |lun —ully < M|lv— wally. 
Setting C = M/a, we obtain 


I[utn —ully SC |lv — wally 
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for all vz € Vy. Since V), is finite-dimensional, the minimum over v, € Vp exists, 
and we obtain Céa’s inequality. O 


Usually, in the finite element method, the spaces V;, are spaces of piecewise poly- 
nomials that are interpolants over triangulations (see Section 4.3.2). So we use the 
approximation properties of V;, inthe V norm. For the Poisson equation, V = H!(Q) 
or a suitable subspace of this. Because this V = H'(Q) norm involves first deriva- 
tives, it is necessary for the triangulations to be “well-shaped” in that the ratio of 
the diameter of the triangles to the narrowest width must be bounded. That is, the 
triangles must have a bounded “aspect ratio”. See Section 4.3.3 for more details. 


6.3.2.3 Conditioning of the Linear Systems 


The error bounds arising from Céa’s inequality (Theorem 6.14) give the impression 
that the only issue is the ability to approximate a solutionu € V by functions up, € Vp. 
From this point of view, the more functions in V; the better. However, the linear 
systems can become extremely ill-conditioned if the basis for V;, is close to being 
linearly dependent. For example, using the basis (x) = xJ-1 for j =1,2,...,n0n 
Q = (0, 1) will result in extremely ill-conditioned linear systems, with the condition 
number growing at least exponentially inn. Small errors in the formulation which can 
include floating point roundoff, or in the numerical solver, can result in potentially 
large errors in the solution. 

Fortunately, well-shaped triangulations can avoid this exponential growth in the 
condition number under some very mild conditions. To see why, we start with the 
mass matrix, which we can use to identify near linear dependence of basis functions. 


For the Poisson problem using a basis {¢1, 2, ..., dy}, the mass matrix M;, is given 
by 
(6.3.26) my = f jC) - outa) de, 

Q 


while the linear system to solve for u;, = ae, uj;@j; has the matrix A, given by 


(63.27) ae / Vb; (x) Voe(x) dx, 
Q 


which is often called the stiffness matrix. 

For a basis coming from an interpolation scheme on triangles, we look to the 
basis ‘or baa b a} for the reference triangle K. As these are linearly independent, 
the mass matrix M for K by itself, 


= [ E@)- Oa, 
K 
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must be positive definite. The condition number k2 (M ) can be used as a measure 
of the quality of the basis on K (smaller is better), but with a fixed basis on K, 
this is a known and finite quantity. For example, if K is the triangle with vertices 
(0, 0), (1,0), and (0, 1), and the basis is linear dr (x,y) =x, oo(x, y) = y, and 
o3(x, y)=1-—x-—y then 


M=—1]121];  m(M)=4. 


We assume that the basis functions we use on a triangle K in a triangulation are 
oj(X) = b (x) where x = T x(x) with Tx an affine transformation K > K. We 
write T x(x) = Axx + br. Note that ||Ax||> = O(hx) where hx = diam(K). We 
assume that the mesh is well-shaped so that k2(Ax) < Kmax no matter the triangle 
K in the triangulation. The mapping from the index for the basis function on K to 
the index of the basis function on K in the triangulation isr tb j = j(K,7r). 

To determine the condition number of the matrix M;, we look to the quadratic 
form 


wh ) 
u M,u = M ju jug 
k=l 


N N 
=f J uj6j(x) (Ymca 


j=l k=1 


N N 
=> 37h do 419i) (doses) 


KeT j=l k=1 


where T is the triangulation of Q. Writing this in terms of the basis functions by we 
have 


N N 
u? Miu = ih Y- ujoj(x) (yym.09) dx 


KeT j=l k=1 
N N 
=F f (Laan @]} | Vawn8@ | ax 
KeT r=1 s=l 


where x = Tx (Xx), 


-) 3 ese dead [a $.@)$,@) at 


KeT r,s=1 
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N 
a a |det Ax| Pe Uj(K,r)Uj(K,s)Mrs- 
KeT r,s=1 


If you write ux for the vector of entries uj(x,-) forr = 1,2,..., N, then 


u’ M,u = ye |det Ax | ul Mux. 
KeT 


Ni can bound duj Mug above and below by > Amin (M) iarales 5 and max (M) lux 3 = = 
Amin (K ) kK (M ) |lux \13 5 respectively, since J Misa symmetric positive definite matrix. 
Also note that |det Ax | = area(K)/area(K), so 


If d > 2 we should use volg(K), the d-dimensional volume, instead of area(K). 
Other than changing from areas to d-dimensional volumes, the remainder of the 
argument here holds for Q C R?. The value of 


ul. Mux = cK Xmin(M) |lux||3 where 1 < cx < K2(M); so 


area(K ) ~ 
u’Myu = )~ Cx Amin(M) lle x 3 
pars area(K ) 


min) 
area(K) © 


min (M) 
= area(R) / area(k) 2 wear WicK,r): 


dX area(K )cx |lux iB 


We can turn this into a sum over j by reversing the order of integration: 


~ N 
Amin M 
ul M,u = Amnin(M) ye ye area(K )cx us. 


area(K ) jal | KeKY) 


Here K(j) is the set of all triangles K € T, where j = j(K,r) for some r = 
1,2,..., N. This inner sum eKexy) area(K )cx is between eKexy) area(K) and 


eKexy) area(K ) ko(M). The lower bound is the sum of the areas of the triangles K 
on which ¢; is not identically zero. We will call this areaa; = )>, eK(j) area(K) = 


area(supp @;) where supp g = {x | g(x) 4 0}. Then we have 


468 6 Differential Equations 


u! Myu — 


(MM) & ; K 
Amin(M) > Ee area( | ju? 


area(K ) = Vex; area(K) i 


The quantity in square brackets is a weighted average of numbers between one and 
K2(M), so we can write 


~ oN 
Amin(M) ya ~ 

T min 2 

u’ M,u = CjajUs, 

e area(K) 2, prise 


where 1 < Cj < Ko(M). Then 


(6.3.28) Se Ce a: 
min; Oj min area(supp ¢;) 


This gives modest bounds on «2(M;,) provided the areas of the supports supp @; do 
not vary widely in size. 

There are cases where we do want the supports of @; to vary widely in size: 
there might be regions where the solution is less smooth, or changes rapidly. In 
these regions we want the triangles to be small, while in regions where the solu- 
tion is more smooth, or changes slowly, we want to use lar ger triangles. If we take 
D to be a diagonal matrix with diagonal entries Dj; = =a; , We can pre-condition 
M, with D: 


Nii ee) oo. i 
w! DM, Dw = Amin(M) ejaj(a- Mey)? = Amin(M) ) em 
area(K) j=l f area(K) 


so DM; D has condition number bounded by ko(M ). So even in this case, M;, has a 
diagonal preconditioner that gives a condition number that is bounded independently 
of the triangulation. 

We still want to get a bound on &2(A;,), and for the Poisson problem, 


re i Voj(x)" Voe(x) dx. 
Q 


For conditioning of A;, having a well-shaped triangulation is more important than 
for M),. However, we will use the condition number of the mass matrix M;, to help 
give us a condition number of A;,. More specifically, we use 


u’A,u =u Aju u™ Myu 


uu ul’M,u uu 
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Letting uj, = aan uj; we can write wu’ A,u = f, Vuj, Vu; dx while u’ M,u = 
- ° ua dx. Note that for the Poisson problem with boundary values specified, we have 


un(x) = 0 for x € OQ. The ratio 
ul M,u 


uu 


lies between Amin(M;,) and Amax(Mn) = Amin(M;) K2(M),), and so cannot vary 
greatly. The ratio 


ul A,u - J Vuj, Vn dx 


6.3.29 = 
( ) ul M,u fauzdx 


If we allow u,, to range over all functions u in H!(Q) with zero boundary conditions, 
there is still a lower bound to this ratio [, Vu" Vu dx/ {.. u* dx that is positive. The 
minimum is, in fact, the smallest eigenvalue of the negative Laplacian operator: 


—Wu=du ing, 
u=0 ondQ. 


However, there is no upper bound to the ratio [, Vu? Vu dx/ J. u? dx. Instead, we 
need to use inverse estimates, which come from the elements. 

As before, we have the basis for K given by (di, bo, oe on } ,and dj (x) = by (x) 
where x = Tx (x). Now 


N 
<B for alli? =) c,d. 


r=1 


Jz Vul Viedx 
iF a dx 


The bound exists because @ belongs to a finite dimensional space. For example, 
for linear functions over the standard unit triangle K , the maximum ratio is 3/2. 
Now for $;(x) = ¢,(%) and ¢;(x) = ¢;(¥) where x = Tx (X), we have V¢j(x) = 
Ax! Vo-(#) and Vby(x) = Ag” VGs (8). Let un (x) = 2%, uj; (x). Then for x € 


K, Vup(x) = Age aes Wick V Or (x). Integrating over K gives 


N 
i Vun (x)! Vun(x) dx = S wjexuj) | VobjicK n(x)’ Voj(K,s) (x) ax 
K rs=l K 
N 
= ¥ Uj(Kr)4j(Ks) ldet Axl f V-@)" Ag! Ag” VOu@) dB 
R 


rs=l 


N 
< ag], oy det A Vb, (8) Vbs (8) dB 
<|/AR j(K.n4j(K,s) det Ax| | Vo-(%)" Vos (x) dx 
Peel K 


N 


N 
a inva [; |det Ax| fy (= veo @)'V (5 veo (x) dx. 


s=l 
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Now we can use the bound for RK: 


2 ui ~ N ~ 
[vent vue ae < ag! det Axi fv (ein @"V (5 seo] @) az 
r=1 s=l 


N 
<B Jax [petal [ (Seo (8)? dé 
= BA; |; | montas. 
Noting that K2(Ax) = ||Ax|l> Ax! 


O(Kmax/ hx). 
Summing over all triangles, we get 


By) Vun(x)" Vun(x)dx < BY ~ ||Ag! if Un(x)? dx 
7 KeT af 


KeT 


>» we see that || Az' ||, = K2(Ax)/I|Axllo = 


14/2 
< Bmax | Ax’ [2 Do fuera, 


That is, 
[ vance Gun ae = Oc(min hey) [ uj(x)° dx, so 
Q Q 
ul Ayu 
6.3.30 = O(h~? here Amin = minhx. 
( ) iin (Ayin) Where min hx 
Combined with the lower bound on ul A,u/ul Myu, this shows that hee K2(Apj) Is 


bounded provided we have a bound on /max/ Amin. AS Amin decreases, the condition 
number grows, but only quadratically. 

If we have a situation where Amax/ min is large, but neighboring triangles have 
similar sizes (hx /h, < 2 for any neighboring triangles K and L, for example), then 
it is still possible to have a diagonal preconditioner with k2(DA;,D) = O(h~2.). 


max 


6.3.3 Handling Boundary Conditions 


So far we have considered essential (or Dirichlet) boundary conditions, where 
u(x) = g(x) for all x € OQ with g a given function. There are many other kinds 
of linear boundary conditions, most particularly natural (or Neumann) boundary 
conditions which in this case have the form Ou/On(x) = h(x) on OQ where 0/On 
is the outward normal derivative, and mixed (or Robin) boundary conditions which 
combine the previous two types. Consider the elliptic partial differential equation 
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(6.3.31) — div(a(x) Vu) + bw) u= f(x) inQ. 


We can create a weak form through multiplying by a smooth function v(x) and 
integrating over Q. Then 


: v[—div(a(x) Vu) + b(x)u — fldx 
Q 
= i {—div(v a(x) Vu) + a(x) Vu! Vu + v[b(x) u — f]} dx 
Q 
= -{ va(x) Vu' n(x) dS(x) 
dQ 
+f [a(x)Vov' Vu + d(x) vu — fv] dx. 
Q 


If we have essential boundary conditions u(x) = g(x) forx € Ip with Ip a subset 
of OQ, then we need to impose the condition that v(x) = 0 for x € Ip. On the other 
hand, if we have natural boundary conditions 0u/On(x) = n(x)" Vu(x) = h(x) for 
x € Ty, then we have to set 


0= -| v(x) a(x) Vu(x)! n(x) dS(x) 
dQ 
+ [a(x) Vox)’ Vu(x) + b(x) v(x) u(x) — f(x) v(x)] dx. 
Q 


Provided the part of the boundary on which the natural conditions hold, I'y is comple- 
mentary to the boundary where the essential Dirichlet conditions hold 'y = 0Q\Pp, 
we can write 


/ v(a) a(x) Vala)" n(x) asx) = [ v(x) a(x) OM acts 
AQ On 


Ty 


= i v(x) a(x) h(x) dS(x). 


Ty 


So the weak form for these boundary conditions is 


/ [a(x)Vu(x)’ Vu(x) + b(x) v(x) u(x) | dx 
Q 


=| fervear+ [ v(x) a(x) h(x) dS(x) 
Q 


Ty 


(6.3.32) for all smooth v where v(x) = 0 on Ip. 


The Galerkin method for this problem is then: find vu, € V;, where u;,(x) = g(x) for 
x € Tp, and 
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i [a(x) Vu; (x)" Vun(x) + d(x) vj, (x) un(x)] dx 
Q 


=| renwar+ [ Un(x) a(x) h(x) dS(x) 
Q Ty 


(6.3.33) for all v, € V, where v,(x) = OonT p. 


This gives a symmetric linear system of equations for the coefficients uj; in u; = 
yo, u;6;. The matrix entri 
j=1 4j0;- The matrix entries are 


bas [ [a(x) VG; (x) Vee) + d(x) 6) (x) di (x)] dx. 


Provided a(x) > 0 and b(x) > 0 for all x € Q, and Ip has positive length (for 
d = 2) or area (for d = 3), then the matrix A, = [ajk en eee N] is also 
positive definite. The right-hand side is the vector with components 


f= [ terojarax + f bj(x) a(x) h(x) dS(x). 


Robin boundary conditions are a mix of Dirichlet and natural boundary conditions 
and have the form 


Ou 
(6.3.34) an +c(x)u(x)=h(x)  forx e Tr COQ. 


The weak form of (6.3.31) with u = g on Ip and Ou/On + cu =h on Tr where 
Tp and I'y partition the boundary OQ is 


i [a(x) Vv’ Vu + b(x) uv] dx + / a(x)c(x) uv dS(x) 
Q 


Tr 


= | farvar+ [ a(x)h(x) vdS(x) 
Q Tp 


for all smooth v with v = 0 on I’p. Usually we require that c(x) > 0 forallx € Ty. 
The stiffness matrix is then given by 


he = | favor vo, + b0) 4 éj)ax+ [ a(x)e(x) $$) dS(x), 


and is again positive definite under the inequalities assumed above on a, b, and c 
provided area(I'p) > 0 or b(x) > 0 forall x € Q. 
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6.3.3.1 Numerical Integration 


In general, these entries will need to be computed numerically using a suitable numer- 
ical integration method, such as are described in Section 5.3.4. This will perturb the 
matrix entries and the right-hand side of the linear system to be solved. Using a 
numerical approximation of these integrals 


M 
aje © DY) we [a(ze)V Gj (Ze) Vou (Ze) + b(Ze) dj (Ze) Pu (Ze)] 


é=1 


we still obtain symmetric, and provided there are sufficiently many integration points 
Ze in each triangle, positive definite linear systems of equations. 

Given an integration method on a reference triangle K and using an affine trans- 
formation Tx: K — K we have a corresponding integration method on K: 


M 
[ i@ae~ Paden. 
a =1 
If v(x) = we (x) where x = Tx (¥) = AxX + bx, we have the approximation 
/ w(x) = |det Ax [%@ dt 
K R 
M 


~ |det Ax| ) >. bE) 
f=1 


M 
= |det Ax| )) Be W(T x Ge). 
é=1 


For ¢)(x) = b,(®) when x € K, note that Vd;(x) = A,’ V6,(®), and so 
Vo)(x)" Vox (x) = Vo,(8)" Ag Ax’ VOB), 


where ¢;(x) = ds (x) forx € K. 
If we can write Q = J kez, K where 7;, is a triangulation of ©, then the integral 


i a(x) Vo j(x)" Vega) dx 
Q 
=) i a(x)Voj(x)" Vobx(x) dx 
KeT,, - 


=e det Ag [a(x @)VG.@)" Ag Ag VOu®) ax 
KeT,, K 
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where @ (x) = $-(®), d(x) = $s) for x € K 


M 
~ \~ |det Axl) > Bea(T x &)) VG)" Ag Ag’ Vou Rr). 


Ke, t=1 


Similarly, 


[ vee, (x) G(x) dx 


M 
~ \~ |det Ax| ) > Beb(T x Gr) br Gr) bs @r).- 


KeT, f=1 


If the basis functions dy on K are polynomials of degree < m, we need the inte- 
gration methods to be exact for polynomials of degree < 2(m — 1) in order to get 
convergence, and preferably exact for polynomials of degree < 2m. To see why, con- 
sider a triangle K of diameter hx that is much smaller than the diameter of Q. Even 
though we might assume a(x) and b(x) to change slowly over K, the same cannot 
be said of the basis functions ¢; (x). Thus we need to integrate ¥ Voi Vox dx (and 
af x Pj 9x dx if b Z 0) exactly. For example, if we are dealing with cubic Hermite ele- 
ments we should use an integration method that is exact for degree 4 (and if b € 0, 
degree 6) polynomials. 


6.3.4 Convection—Going with the Flow 


If u(t, x) represents the concentration of an unreacting chemical species pulled along 
by a current in water with velocity v(x), for example, the equation for u(t, x) is 


Ou 


(6.3.35) i 


+ (v(x): V)u = div(DVu)+ f(x) inQ 
with various boundary conditions. The boundary conditions can describe prescribed 
concentrations (perhaps at the inflow to a region: u = g on Ip), and zero flux con- 
ditions (that apply at a wall, for example, where v -nu + DOu/On = 0 on Tz), 
and outflow conditions (Ou/On = 0 on Tg). The velocity field v(x) represents the 
velocity of the current at x. 

If we look for steady-state solutions, we set Ou /Ot = 0 and so 


(v(x) -V)u —div(D Vu) = f(x) in. 


If we use w(x) as a smooth function for creating the weak form, then the weak form 
is 
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Ou 


[ [wo vu4 Dvw"vulax— f wD Sas = w fixrde. 
Q Q 


dQ On 


The main difference with the equation without convection is the term /, gwv-Vudx. 
Note that 
div(w uv) = (uv): Vw+ (wv): Vu+wudivv. 


If div v = 0, which is the case for an incompressible flow field, then 


[we vuax = f toivewuv) — wv Vw}dx 
Q Q 


=| wuv-nds— fw Vwax. 
dQ Q 


If w = 0 on O@ then we get 


[ov uae =— fw vwae. 
Q Q 


In terms of the matrices, if bj, = te oj v- Vox dx and either oj; or dx is zero on 
OQ, then by; = —b jx. That is, apart from terms related to the boundary, the matrix 
B,= [Dix II ki= ly 2iaey N] is anti-symmetric. If aj, = ‘tes D Vo; Vox dx, then 
Ajp 1s positive definite provided Ip has length (or area) that is positive. In that case and 
ignoring the boundary terms, A;, + B, would also be positive definite. The resulting 
system of equations is then invertible. The condition number of A; + By, would also 
not be much larger than that of Aj. 

There can be problems where D > 0 but is small, and v(x) is large. This 
is the convection dominated regime. Although, apart from boundary terms, B, 
is anti-symmetric, if B, has large entries this can cause problems with solving 
(A; + Ba)un = fi, for un. 


6.3.5 Higher Order Problems 


Fourth order partial differential equations arise in a number of settings, such as elastic 
plate problems. A typical example is the biharmonic equation that can be written as 


(6.3.36) AAu = f(x) in 


where A is the Laplacian operation (Au = 0?u/Ox* + 0?u/Oy? in two dimensions) 
and appropriate boundary conditions, such as Dirichlet conditions u(x) = g(x), 
Ou/On(x) = k(x) for x € OQ. The weak form of the equation with Dirichlet bound- 
ary conditions is that 
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Fig. 6.3.4 Triangle 
decomposition 


(6.3.37) ip [(Au)(Av) — fv]dx forall smooth v, 
Q 


where v = Ov/On = 0 on OQ. Standard conforming finite element methods have to 
use basis functions ¢; where = (A¢;)* dx is finite. This means that if ¢; is piecewise 
smooth, then there cannot be any jumps in V@;. The basis functions should therefore 
be C! (continuous first derivatives), which are harder to create. Section 4.3.2.1 
shows some examples: the Argyris element (Figure 4.3.7), and the HCT macro 
element (Figure 4.3.8). The order of convergence of these methods is essentially 
given by the order of the polynomials that can be represented by the elements used. 
These C! finite elements are complicated to construct, so there has been a great deal 
of interest in other methods of solving equations like the biharmonic equation. The 
equation AAu = f in Q with Dirichlet boundary conditions is an elliptic partial 
differential equation on H*(Q). Most of the theory of this section can be extended 
to problems of this type, although the condition number of the system of equations 
K2(Ap) = O(h~*) rather than O(h~ ) for the second order elliptic equations. 


Exercises. 


(1) Implement the finite difference method for the problem V - (a(x) Vu) = f(x) 
forx € Q C R* and u(x) = g(x) for x € OQ using the approximation 


ee (au, yee ») ~ : [acs f se aes » HON t a(x sine mi » mcs >) : 
Show empirically that the asymptotic error of this formula is O(h’”") ash > 0 
for some m assuming that both a and u are smooth. What is the value of m in 
general? 

Consider solving the partial differential equations divg = f(x) andg = Vu in 
Q C R?’. From the divergence theorem, note that for a region R, 7 pdivgdx = 
Sor -nd§S. In particular, for a point (x;, yz) € Q where x; = x9 + jh and 
Ye = yo t+ kh, let R be the square (x; — sh, xj + sh) xX OX - sh, Ye + sh). 
We set the values of 


(2 


wm 


i re 1 Z 
N(xj, Ye 5h) sign(ye + 5”) =h™' [u(xj, x Eh) — uj, ye) ], 


6.3 


(3) 


(4 


wm 


(5 


wm 


(6) 
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q(x; + sh ye) sign(x; + 3H =h™'[u(xj £A, ye) — uxj, x) ). 
Develop equations for the differential equations for values of u and q1, q2 at the 
above points using f,,q-ndS = {,divqdx © f(x;, yx) h’. This approach, 
using fluxes like g and approximations of integrals, is known as the finite volume 
method. 

For Q = { (x,y) |x?+y? <1 | and a regular grid (x;, yx) with x; = j h and 
ye = kh for grid spacing h > 0, let A; be the matrix for the differential equation 
-V7u = f (x) in Q with boundary conditions u(x) = 0 on OQ discretized using 
the finite difference method on these grid points. Plot the condition number 
K2(A,) against h for h = 2-* k=1,2,...,7. Note that if A, isn, x np, then 
np is the number of grid points in Q, which is ~ 7/h? as h > 0. Because 
ne becomes large rapidly as h decreases, it is important that if A; is created 
explicitly, then it should be represented as a sparse matrix. 

Suppose we apply the finite element method to V - (a Vu) = f(x) in Q witha 
triangulation 7, with each element T € 7), having diameter < C; h. Suppose 
also that a(x) is smooth. Suppose that we use Lagrange basis functions of 
degree d (as illustrated in Figure 4.3.1). Show that using an integration method 
that is exact for polynomials of degree < m will give errors in the estimate of 
Sg Ux) VO (x) Vox(x) dx of size OCK™™4**) fi, |Vbj(x)|] Vox) Il dx 
as h +0 for basis functions ¢;, ¢;, provided m > 2d —2. [Hint: Let 
a(x) be an interpolant of K of degree m—2d+2; use the estimate 
Maxxex |a(x) — @(x)| = O(h"-*449) J 

Consider the PDE boundary value problem 


div(a(x)u) = f(x) forx EQ, 
B(x)u(x) + mex) =g(x) forx € dQ. 
n 


Write this PDE in weak form. Show that the bilinear form in the weak form is 
coercive with respect to H! (Q) provided infyeg a(x) > Oandinf,<ag B(x) > 0. 
Assume that tees u? dS(x)/ fg [Vull? + u? | dx has a positive lower bound. 
These kinds of boundary conditions are called Robin conditions after Victor 
Robin (French mathematician, 1855-1897). 

Use some implementation of the finite element method (perhaps 
this chapter’s Project), to solve the equation —V7u= f(x) in Q:= 
{ (x,y) |x? +y? <1& max(x,y)>0} with ue, y)=g(x,y) for 
(x,y) €0Q using the finite element method with piecewise linear 
basis functions on a triangulation you generate. The function g(x, y) 
is not piecewise linear, but we can create a piecewise linear approx- 
imation by minimizing [Pace (x) — g(x))*dS(x), which can be done 
by computing the mass matrix on the boundary: solve Mi = b where 
Me = fog P(X) be(x)dS(x), be = fag B(x) Ge(x) dS(x), and we set 
Un(p;,) = Ux Where p, € OQ. We use g(x, y) =x? —e’xy+cos(x + y) 
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and f(x, y) = 2cos(y +x) +x (y +2) e” — 2. Note that the exact solution 
is, in fact, u(x, y) = g(x, y). Use this fact to compute the error in the finite 
element solution with an element size of approximately 0.1. 

Any triangle can be subdivided into four congruent subtriangles (congruent, 
meaning same size and shape, but possibly different positions and orientations): 
join the midpoints of each edge. This is illustrated in Figure 6.3.4. Write code to 
take a given triangulation represented by p (position) and ¢ (triangle) arrays as 
described in Exercise 2, and produces a new triangulation where each triangle 
in the original triangulation is replaced by its subdivision into four subtriangles. 
Using the subdivision method of the previous Exercise and the method used in 
Exercise 2 to estimate the convergence rate of the finite element method. Do 
this by plotting max; |(wn)i — u( Pi)! against the number of subdivisions made 
to the original triangulation. Do this for j levels of triangle decomposition for 
j =0, 1, 2,3, 4. Plot the results with a logarithmic scale for the maximum error, 
but linear in 7. Use this to obtain an estimate for the maximum error of the form 
C h® where h is the maximum diameter of the triangles in the triangulation. Note 
that each level of triangle decomposition halves the value of h. 

Let 6; i> J = 1,2, 3, be the standard linear nodal basis functions on the standard 
unit triangle K with vertices 0, = (0,0), 0) = (1, 0), and ¥. v3 = (0, 1), and Ue 
k =1,2,...,6, be quadratic nodal basis functions on K (see Figure 4.3. I(a) 
for iniemolaion nodes). Find the matrix B (3 x 6) so that o; = j= ~ 1 bik We. 

Find a pseudo-inverse C (6 x 3) so that BC= = I (3 x 3). The matrix C should 
have proper symmetries so that if T: K => K isanaffine symmetry of K where 
de = 0b; ; 0 T and dm = We oT, then Cy = cy;. [Note: The pseudo-inverse can 
be the Moore—Penrose pseudo-inverse (2.5.10).] 


6.4 Partial Differential Equations—Diffusion and Waves 


Introducing time into partial differential equations gives a new range of phenomena 
as well as numerical methods. Most methods for these problems are time-stepping 
methods, that combine methods for initial value problems discussed in Section 6.1 
with the methods for partial differential equations in Section 6.3. Standard examples 
of these kinds of equations include the diffusion or heat equation 


Ou 


(6.4.1) — =V-(D(x)Vu)+ f(t,x) in 


Ot 


and the wave equation 


(6.4.2) 


Ou 


az = V-(CVu)+ fit,x) ind, 
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The boundary conditions can be Dirichlet type conditions (u(t, x) = g(t, x) for 
x € OQ) or Neumann type conditions (Ou/On(t, x) = g(t, x) forx € OQ), or mixed 
Robin type conditions (au(t, x) + GOu/On(t, x) = g(t, x) for x € OQ). It should 
be noted that the behavior of solutions to diffusion and wave equations are quite 
different. Solutions of diffusion equations rapidly smooth out as time increases, 
while solutions of wave equations generally maintain their roughness over time. 
Nonlinear wave equations are particularly challenging. 

Typically, finite element or finite difference methods are used for the spatial vari- 
ables, while standard ODE methods are used in the time variable. More recently, 
there have been considerable efforts to develop so-called space-time and moving 
mesh methods that can have advantages in dealing with, for example, simulations of 
isolated waves. Space-time discretizations are outside the scope of this book. 


6.4.1 Method of Lines 


The method of lines first uses a discretization of the spatial variables, leaving the 
time derivatives as time derivatives. For example, consider the standard diffusion 
equation 


a =Vut+ f(t,x) ind, 


u(t,x)=g(x) ondQ, 


we can apply the finite element method in x: triangulate ©, and create a basis of 
functions ¢;, i = 1,2,..., N, based on a particular element for this triangulation. 
We write u(t, x) = cy u;(t) d(x) in terms of the basis functions ¢; with time- 
varying coefficients u;(t). Applying the weak form with v(x) = ¢;(x) with v = 0 
on the boundary 0Q, 


it 


nv du; 
p> Fe (oi (8) 0 (0) dx = 


i=l 


N 
uj (NV; (x) Voix) + ft, x) oui (19; | dx 


j(x) dS(x). 


i is il 


The final integral is zero as dj; = 0 on OQ. We set ee uj(t)d(x) © g(x) for 


x € OQ. If we are using a nodal basis, we can set u;(t) = g(x;) where x; € OQ is 


the “node” for basis function ¢;. 

Let u;,(t) be the vector of coefficients u;(t) where ¢;(x) is zero for x € OQ. The 
values of u;(t) are assumed to be fixed by the boundary conditions if ¢;(x) 4 0 for 
some x € OQ. If we set N = {i | 6;(x) = 0 for all x € OQ}, then we obtain the 
ordinary differential equations 
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du; 
D ([vieiar) Sto =- DY (f vor vojae) win - DO (f voP vo; as) a 


ieN icN i¢gN 
(6.4.3) +f ft.x)dj(x)dx  forj EN. 
Q 


The matrix with entries mij = [, ¢; 6; dx, i, j €.N is the mass matrix My, 
(6.3.26), while the matrix [a;; = [, V7 Vo; dx | i, j € N] is the stiffness matrix 
Ap (6.3.27). Both are symmetric positive semi-definite matrices; the mass matrix is 
positive definite provided the basis functions ¢;, i € N are linearly independent over 
Q. 

The resulting ordinary differential equation for uw; (t) is 


a 
(6.4.4) M, — =—A,ut f,(t). 


For the wave equation, the resulting differential equation is 


dup, 


6.4.5 M 
(6.4.5) h 


= —An u + St, (@). 


A variant of this approach is the Jumped mass approximation, which approximates M;, 
with a diagonal matrix: M = diag(m; | i € N) with m; ~ ye mj;, for example. 

The eigenvalues of the stiffness matrix A; approach the eigenvalues of the negative 
Laplacian —V°; the eigenvalues dj of —V? are > 0 and Aj > +00 as j — oo. The 
large eigenvalues of —V? have highly oscillatory eigenfunctions and correspond 
to high frequency components of the solution. The eigenfunctions associated with 
large eigenvalues —V? are hard to approximate, and so the large eigenvalues \ j of 
A, will not necessarily be good approximations of \;. However, even in this case, 
the eigenvectors of A; with large eigenvalues will also represent highly oscillatory 
functions. 

The theory of diffusion and heat equations can be understood in terms of the 
eigenvalues A; and eigenfunctions 7; of —V? on Q with homogeneous Dirichlet 
boundary conditions: 


—V7 bj = Aj; in Q, 
wj(x)=0 forx € ag, 


for Dirichlet boundary conditions. We order the eigenvalues 0 < A, < An <---. If 
we write the solution u(t, x) = pee cj (t) Wj (x) then the coefficients c ;(t) are solu- 
tions of the differential equations dc; /dt = —Aj;c;; thatis, c;(t) = c;(O) exp(—Ajr). 
Thus, the coefficients c;(t) — 0 as t + +00. Furthermore, the exponential decay 
is much more rapid for large j, and high frequency components are rapidly damped 
out. 


The wave equation, on the other hand, has u(t, x) = ae cj (t) wj(x) where 
d°c;/dt? = —Ajc; and so cj(t) =a; cos(Aj!"t) + b; sin(A;!"t) for suitable con- 
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stants a; and b;. Thus the solutions do not become smoother with time in general. 
In one spatial dimension, the wave equation becomes 07u/0t? = 0°u/Ox", which 
has solutions of the form u(x — t) + u_(x + f) for any functions uv, and u_. Apart 
from momentary cancellation, any roughness in either u_ or w+ persists for all time. 
In more than one dimension it is possible to have focusing solutions formed by, for 
example, a radially symmetric wave that arrives at a single point at an instant. For the 
wave equation 07u/0t? = V7u, the energy te 5 [(Ou/dt)? + Vull3 | dx is constant 
provided either Dirichlet or Neumann boundary conditions hold. 


6.4.1.1 Stability Issues 


The numerical discretization in time should be appropriate for the type of equation. 
The fact that some eigenvalues of A; are large, means the differential equations we 
obtain are stiff, and that either 


(a): the step size in time should be limited according to the maximum eigenvalue 
of Ay, or 
(b): implicit methods for the time stepping should be used. 


In case (a), the step size is limited due to stability considerations. To see how this 
works, consider the one-spatial-dimension diffusion 0u/Ot = 0?u/Ox? and wave 
equation 07u/0t? = 0?u/Ox? with Dirichlet boundary conditions u(0) = u(1) = 0, 
discretized using equally spaced piecewise linear elements. In this case, 


2-1 41 
14 1 


both (N—1)x (N-—1), with h=1/N. The matrices A, and M;, have 
common eigenvectors v,; with (vj)e=sin(7j€/N) and eigenvalues A; = 
N(2—2 cos(rj/N)) for A, and pw; =(4+2 cos(rj/N))/N for M), with 
j=1,2,...,N—1. The eigenvalues of M,,' An are then Aj; /pj = Nr(1— 
cos(mj/N))/(2 + cos(wj/N)). 

If we use the explicit Euler method (6.1.8) for the diffusion equation, then 


Uneri = (I —(At)M,'An)utne + (At)My | f (te). 


Stability of this method then depends on the eigenvalues of the iteration matrix 
I- (At)M,, ‘An which are 


1 —cos(7j/N) 


2 
1— (At)N 24 cos(nj/N)’ 
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The worst case is at j = N — 1 where the eigenvalue of the iteration matrix is 


= cos(r(N = D/N) 2(At)N? 


{= (AnN?} 
2+ cos(m(N — 1)/N) 


To keep the magnitude of this eigenvalue less than one, we need At < 1/N?, which 
can be a strong limit on the time step. 

If we use a different method for the diffusion equation, we need —2(At)N ? to be 
inside or very close to the stability region for the method. If N is large, as we need for 
accurate spatial approximation, then for any explicit method, we need At = O(N~?). 
In general, if we use an explicit time-stepping method and a finite element method 
based on a triangulation 7;,, we need At = OE.) where Amin = Minx eT, hx is the 
diameter of the smallest element in the triangulation. 

While the high frequency components of the solution are forcing a small time step 
for the sake of stability, our interest is usually in the lower frequency components of 
the solution which are accurately resolved spatially. The rate at which the lowest fre- 
quency component is damped in our one-spatial-dimension problem is * exp(—171) 
and does not depend significantly on NV. 

Discretizing the wave equation gives 


dup 


h 
dt? 


= —Aj,uyn + f(t). 


Writing v;, = du;/dt we have the first order system 


Applying the explicit Euler method gives 


Unt = Un + (At)vn x, 
Vi kei = Vpn — (At)M, Anun, + (ANM,' fi (t)- 


I (At)I 
—(At)M,'A,' I 
numbers 1 + i (At),/A;/j;; thus the explicit Euler method is always unstable for 
the wave equation just as it is for simple harmonic motion d7u/dt? = —w?u. We can, 
however, note that for 7 = N — 1, perturbations in the direction of the corresponding 
eigenvector are amplified by a factor of © \/1 + (At)? N? with each iteration. Even if 
the method chosen is an explicit method whose stability region includes some of the 
imaginary axis, such as the original 4th order method of Runge and Kutta (6.1.19), 
we find that stability requires that (At)N is bounded. This property is related to, 


The eigenvalues of the iteration matrix are the complex 
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although not identical with, the Courant—Friedrichs—Levy (CFL) condition [60] for 
applying explicit numerical methods to the equation Ou/Ot + a Ou/Ox = f(t, x). 

One conclusion of this analysis, for numerical methods applied to wave equations, 
is that implicit methods should be used. Furthermore, the stability region of the 
method and its boundary should include the imaginary axis. For example, the implicit 
Euler method, the implicit trapezoidal rule, the Gauss methods, and the Radau ITA 
methods are all acceptable for solving the discreteized wave equation. However, the 
implicit Euler method and the Radau IA methods are all dissipative and smooth 
out the waves, while the implicit trapezoidal method and the Gauss Runge-Kutta 
methods preserve the energy Je 5 [vz + ||Vup 5] dx. 


6.4.1.2 Numerical Dispersion 


Even if a stable numerical method is used for the wave equation, there are a number of 
errors that arise in numerical solutions that users should be aware of. One of these is 
dispersion, where different frequency components have different wave speeds. Exact 
solutions of the wave equation have no dispersion. Dispersion occurs physically 
for light passing through water or glass, where different colors are refracted into 
different angles because of the slightly different wave speeds of light of different 
colors. Other systems where dispersion occurs naturally include cables. This was 
very important for long-distance telegraphy in the nineteenth century. The telegraph 
equations developed by Oliver Heaviside [121, vol. 2, p. 52] model the voltage and 
current along a transmission line: 


OV Ol 
a ee ee) = —RI(t,x), 


ol OV 
ay x) + oa x) =—-GV(t, x). 


These equations show dispersion as well as damping if LG ¢ RC; in this case, 
sinusoidal waves of different frequencies traveled along the transmission line with 
different speeds. The consequence is that the shape of the signal changes as it travels 
along the transmission line. As a result, what was originally sent as a clear on/off by 
a telegraph operator at one end of the cable becomes a smeared out oscillation at the 
receiving end. 

Even though the wave equation itself has no dispersion, numerical methods for the 
wave equation do have dispersion. As an example, Figure 6.4.1 shows snapshots of the 
results of the implicit trapezoidal method using the finite element method spatially for 
the wave equation 0?u/0t? = 07u/Ox* with Dirichlet boundary conditions u(0) = 
u(1) = 0. Piecewise linear elements with N + | equally spaced nodes x; = jh, j = 
0,1,2,..., N, are used for the finite element method. Specifically, we use N = 102 
and time step At = 10-3. The initial conditions are u(0, x) = x(1 — x) exp(—a(x — 
xo)*) with a = 50, and v(0, x) = —(d/dx)u(O0, x). The exact solution preserves its 
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(A) Profiles for t = 0, 10, 20 (B) Profiles t = 100, 110, 120 


Fig. 6.4.1 Numerical dispersion illustrated for the wave equation solved by the implicit trapezoidal 
method 


shape away from the reflecting ends of the interval [0, 1]; each reflection inverts 
the wave. Because the length of the interval is one, and the wave speed is one, two 
complete reflections will return the wave to its original shape. In fact, the exact 
solution is periodic with period two. 

Figure 6.4.1(a) shows that the wave speed that arises from the numerical method 
is not exactly one, the wave speed for the exact wave equation 07u/0t? = 0?u/Ox?. 
Figure 6.4.1(b) shows the higher frequency components are slightly faster than the 
lower frequency components of the numerically computed wave. 

Higher order methods in both time and space can reduce the amount of dispersion. 
However, numerical methods of the type we have been discussing, which lead to 
finite difference schemes that are independent of position and time, cannot eliminate 
dispersion. In two or higher dimensions, irregular spatial discretizations can lead 
to other numerical artifacts, such as numerical scattering where a wave passing 
over a region results in multiple waves moving radially from the irregularity. 
Dispersion and scattering cannot be eliminated by conventional numerical methods, 
but users should be aware of the phenomenon, especially for long integration periods. 


Exercises. 


(1) Implement the piecewise linear finite element method for 07/0x? on Q = 
(0, 1) with the boundary conditions u(t, 0) = u(t, 1) = 0. Use this to solve 
Ou/Ot = O?u/Ox? + f(t, x) with these boundary conditions. Do this with spa- 
tial grid spacing h = 1/N for N = 2/ with j =2,3,4,..., 10. Apply this 
to the problem where f(t, x) = 2e'(1 — 2x) sinx + 2e°(1+ x — x7) cosx + 
e*[(sint — cos t)x* + (3 sint + cos t)x]. The exact solution is u(t, x) = x(1 — 
x)[e’ cos x + sin t e*]. Use the implicit Euler method to solve the discretized dif- 
ferential equations with At = 2-*, €=1,2,..., 10. Plot the logarithm of the 
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maximum error against (log(At), log N) to form a three-dimensional surface 
plot. 

Show that the matrix exponential e? = J + B + (1/2!)B* + (1/3) B?+--- 
converges for any matrix B. However, operators like V* are unbounded; its 
eigenvalues \; —> —oo as j — oo. Good approximations can be found even for 
unbounded operators like this using Padé approximations (see Exercise 10) of 
the exponential function: e* © r(z) = n(z)/d(z) where n and d are polynomials 
andr (0) = 1 fork = 0, 1,2,...,m. In order to handle unbounded operators, 
we want r(z) = n(z)/d(z) to be bounded as z — —o0, so that degn > degd. 
Then we can approximate e’”8 ~ r(hB). Show that e’® — r(hB) = O(h"*!) as 
h—> 0. 

If r(z) = n(z)/d(z) is a Padé approximation to e* then for B symmetric, show 
that the condition number «2(d(AB)) = max, |d(hA)| / min) |d(hA)| where A 
ranges over the eigenvalues of B. If ||4B||2 is large and deg d is large (say, over 
four), then this condition number can be very large. Instead, it can be very useful 
to factor d(z) into linear or quadratic factors to avoid the problems of solving 
a system of equations with large condition number. The Padé approximation of 
the exponential function with (degn, deg d) = (2, 4) is 


os l+icty2 n(z) 


a 2 1 1 1 - , 
1- 32+ nee = ae + mee d(z) 


Factor d(z) = d;(z) d2(z) into quadratics. Assuming —B is positive semi- 
definite (so that \ < 0 for any eigenvalue \ of B), bound «&2(d);(4B)) and 
K2(d2(hB)) in terms of h || Bl|>. 

Consider the system of differential equations 


du 
M—=-A t 
i u+ f(t) 


with M and A symmetric and positive definite. Show that 


£ (5u7 mx) =—-u' Au+u' f(t) <u’ f(t). 


If M is the mass matrix and A the stiffness matrix for a partial differen- 
tial equation, interpret this to show that {, un (t,x)?dx < eee (0, x)2dx + 


i Joun(t, x) f(t, x) dx. 
The sine—Gordon equation is the differential equation 
Ou = Ou 
Ore Ox? 


—sinu. 


Solve this numerically using the finite element method spatially with piecewise 
linear elements and the implicit trapezoidal rule in time. Use periodic boundary 
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conditions u(t, L) = u(t, —L) and Ou/Ox(t, L) = Ou/Ox(t, —L) with L =5, 
and spatial grid spacing h = 1/N with N = 100. The sine—Gordon equation is 
notable in that it has wave-like solutions called solitons of the form Ugoliton (ft, X) = 
4 tan! (exp(y(x — xo — vt))) where Y1l—v)=1. Apply your numerical 
method to the initial conditions 


u(0, x) = 4[tan7' (exp(y(x — xo))) — tan™! (exp(—y(x + x0)))] , 
Diba a exp(y(x — x0)) exp(—7(x + x0) 
—(0, x) = 4v ; 
Ot 1+ exp(y(x — x9))? 1 + exp(—y(x + x0))? 


with x9 = 2, v= 5, and y=2//3. Compare your numerical results 
to the exact soliton solution. [Note: To solve the implicit trapezoidal 
equations for the updated velocity vz; in terms of the previous pair 
(ux, vg) can be reduced to (My, + 4(At)?Ap)ve41 = (My, — G(At)? An) dE — 
(At) Anu, — (At)p(ug + rAt(v, + vx41)) where y(z); = sin(z;). The fixed 
point iteration (Mj, + 1(Ar)?Aj) vei)? = (Mn — L(At)?An) og — (At) Ante — 
(At)p(u, + 5 At (v, + Be ys £=0,1,2,... converges rapidly for small Ar, 
and only requires Mj; + 5 (At)? An to be factored once. ] 

(6) The shallow water equations in two dimensions are an approximation to the 
Navier-Stokes equations for fluid mechanics where a viscousless fluid moves 
over a two-dimensional surface with varying depth H(x) creating waves of 
height h(t, x) with |i(t, x)| «< H(x) and the depth is much smaller than the 
horizontal scale. These equations, ignoring Coriolis effects due to the rotation 
of the Earth, can be written as 


One 8, Oh 3 
pa Ce OOP over Q CR’, 
Oh 

am =0 on OQ. 


Here g is the gravitational acceleration, b is a viscous damping parameter, and 
Oh/On is the normal derivative of h on the boundary of Q. Implement this 
using the finite element method with piecewise linear elements for the spatial 
discretization and the implicit trapezoidal rule in time. Test your method on 
Q= { (x,y) | y => x*} with AO, x) = exp(—a ||x — e2||3) and 0h/Ot (0, x) = 
0 with a = 100. 


(7) The time-dependent Schrédinger equation of quantum mechanics has the form 
Ow hn 
ih =-—V V 
ih om VY OY 


where fh = h/(27) is the reduced Planck constant, m the mass of the particle, 
and V(x) the potential energy function. If we create a spatial discretization 
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Wt, x) = Sy W(t) Oe(x) and we apply the Galerkin approximation, show 
that the resulting system of ordinary differential equations is 


dw re ~ 
-—ihM— = —A V 
; dt 2m EY 


where me = fas debe dx, ae = fas VO, Vode ax, and De = 
‘= V(x) dx (x)be(x) dx. Show that, provided that for each basis function 
de we have teas or dx finite, each of M, A, and V are symmetric while M 
and A are also positive definite. What are the possible exponents a where 
w(t, x) = exp(at) o(x) is a solution of the Schrédinger equation for some d? 
Given that the eigenvalues of M~! A are typically large, what regions should be 
contained in the stability region of the ODE method used? 


Projects 


(1) Write code to find geodesics on a surface given by S$ = {x | g(x) = 0} Cc R” 


(2 


a 


with V g(x) 4 0 for every x € S. Do this by solving the geodesic equation 


d’x 
as —AVg(x), 
Soc (dx /dt)" Hess g(x) (dx /dt) 


IVg(x)|I5 


as a two-point boundary value problem with boundary conditions x(a) = Xo and 
x(b) = x,. We can solve this using a shooting method to find vp := dx/dt(a). 
The constraint that g(x(t)) = 0 for all ¢ means that Ve(xo)! v9 = (0. Use the 
variational equations for the geodesic equations (which involves 3rd derivatives 
of g!) to create the Newton equations. Since the Newton equations with the 
condition Vg (x9)! v9 = Ois over-determined, use the least squares solution (that 
is, use the pseudo-inverse (2.5.10) instead of the inverse). Apply this to finding 
geodesics on ellipsoids (x/a)* + (y/b)* + (z/c)? = 1 witha Ab #c. 
Implement a basic finite element system based on piecewise linear elements over 
a triangulation in two dimensions. The triangulation 7 is represented by a pair 
of arrays: the vertices of the triangulation are given by an array p[i, j] = (p;)i 
(the ith coordinate of the jth point) and j = ¢[r, s] being the index in p of the 
sth vertex of the rth triangle. Thus, p is a 2 x n array of real numbers and ¢ 
is an m x 3 array of integers where n is the number of points and m is the 
number of triangles. Assume that the region Q is the union of the triangles in 
the triangulation. Write code to compute 
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ake = - [xy Vox Ce)” Voe(x) + BOX)” Vox) oe) + 1) 6x @)Oe(H)] dx, 


fi= | Fey oueras 


fork, @=1,2,...,n, given functions a, 3, y, and f. Here ¢, is the piecewise 
linear function with ¢(p;) = 1 if k = j and zero if k # j. Note that de — 
Eel, x Where K ranges over triangles in the triangulation. For triangle K with 
vertices Pj> Pes Pes @; is zero on K provided r 4 j, k or &. Check that axe = 0 
if p, and p, are not both vertices of the same triangle K. Thus A is a sparse 
matrix. It can be constructed by summing over triangles: 


A<0; f<0 
for KeT 
J <vertices(K) (set of vertex indices) 
for kleJS 
ake <— ane + fp [a(x)V dx (x)" Vobe(x) + +++ J dx 
fe <— fe + fe F(X) be(x) dx 
end for 
end for 


The integrals can be computed by a numerical integration method of your choice, 
preferably one that is exact for quadratic functions at least. 

Boundary vertices and boundary edges can be determined as follows: each 
boundary edge is an edge of exactly one triangle; a boundary vertex is an 
endpoint of at least one boundary edge. Normally, every boundary vertex is 
the endpoint of exactly two boundary edges. We set wag = [ue | Xe € OQ], 
and Wing = [ue | Xe ¢ OQ]. Similarly we let Aine ima = [axe | Xe, Xe ¢ OQ], 
Aima.aa = [axe | Xe ¢ 2, Xe ¢ OQ], ete. 

For setting Dirichlet boundary conditions, such as u(x, y) = g(x, y) for (x, y) € 
OQ where g is not piecewise linear we can set up a least squares approximation 
as follows: create a “boundary mass matrix” mje = J a Og (x) be(x) dS(x) and 
the vector de — _ a oe(x) g(x) dx. Then the boundary values uy, where x; is a 
boundary node, can be computed by solving M Ugg = b. The partial differential 
equation with Dirichlet boundary conditions can then be solved approximately 
by solving Aine inline = f into — Ainte,aauag where Ugg has already been 
computed. 

If you are feeling ambitious, extend your code to also handle quadratic Lagrange 
elements (see Figure 4.3.6(a)). 


Chapter 7 ®) 
Randomness sieci 


7.1 Probabilities and Expectations 


Probabilities have been used since the ninth century by Arab scholars studying cryp- 
tography [34]. Games of chance were the motivation for Europeans to start studying 
probabilities with the work of Gerolamo Cardano [48], Pierre de Fermat, and Blaise 
Pascal in the seventeenth century [169, 218]. The Dutch physicist Christiaan Huy- 
gens was also in this tradition of studying games of chance, writing a textbook about 
it [130] (1657). Jakob Bernoulli (1654-1705) and Abraham de Moivre (1667-1754) 
also contributed to the development of the theory. In the nineteenth century, Pierre 
de Laplace (1749-1827) contributed greatly to the theory with his work Théorie 
Analytique des Probabilités (1812) [154]. From the late nineteenth century to the 
twentieth century contributors include Chebyshev, Markov, and Kolmogorov. It is 
Kolmogorov that created the modern theory of probability with his 1933 monograph 
Foundations of Probability Theory [147]. We start with Kolmogorov’s formalism. 


7.1.1 Random Events and Random Variables 


Kolmogorov’s insight was that we need to start with a space of “all things that can 
possibly happen” called an event space Q2. If we were dealing with Monopoly games, 
then this space would include all possible Monopoly games with details down to the 
level of the individual dice rolls. In general, this event space may be discrete or 
continuous, or some combination of the two. 

With an event space 2, we do not necessarily assign a probability to each indi- 
vidual w € Q; for continuous quantities, the probability of a single specific w € Q 
would be zero. (Example: What is the probability that a number, chosen at random 
between zero and one, is exactly 1/2? If we start writing out a decimal expansion 
of a random number, the probability of getting “5S” followed by infinitely many 
“O”’s is zero.) Instead Kolmogorov leaned on the theory of Lebesgue integration and 
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measure theory: there must be a collection % of “measurable” subsets of Q that 
forms a o-algebra of sets. That is, 


ee 8, 
e By, Bo,... € B implies (1\~, B; € B and U<, B; € B, and 
e Be Simplies Q\B € B. 


Every B € % is a subset of Q called an event. Every event is assigned a probability 
Pr [B], and Pr is a probability measure on 2. That is, 


CO CO 
Pr (U a) — Leu: (B;) provided B} 1 Bj = @ fori F j, 
i=l i=l 


Pr(Q) = 1, 
Pr(B)>0 forall Be B. 


A consequence of the properties listed for a probability measure is that Pr(J) = 0. 

A random variable X with values in a measurable set A is defined as a measurable 
function X: Q — A; that the function is measurable means that for every measurable 
set F C A, the set {w | X(w) € F } € Band is measurable in Q. The event “X (w) € 
F’’ is often simply written “X € F”, which has the probability 


Pr ({w | X(w) € F}) = Pr(X € F). 


Two random variables X: Q — A and Y: Q — B are independent if for any mea- 
surable E C A and F C B, 


(7.1.1) Pr(X € E&Y € F) =Pr(X € E) -Pr(¥ € F). 


Two consecutive dice rolls are considered independent as the outcome of one (appar- 
ently) does not affect the outcome of the other. Random variables that are related are 
not independent, but we can talk about the conditional probability of one event given 
another. Consider the outcome of getting an “A” on Calculus I and getting an “A” on 
Calculus II. While getting an “A” on one does not guarantee getting it on the other, 
it certainly makes it more likely. To measure this, we use conditional probabilities: 


(7.1.2) P(X € E|Y p= #) 


Provided Pr(Y € F) > Othis gives probabilities “assuming that Y € F”’. A little care 
must be taken with this concept in the case of continuous random variables: if Y is, for 
example, uniformly distributed on an interval [a, b] witha < b, then Pr(Y = c) = 0 
for any c € [a, b]. But we can use Pr(Y € [c, c + 6]) for 6 > 0 and define 
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Pr(X € E&Y 5 
(7.1.3) ~ tim DEX € E&Y € le, c+ 81) 
510 Pr(Y €[c, c+4]) 


The probability distribution of X is the probability measure zy given by 
(7.1.4) wx(F) =Pr(X € F). 


Note that zy is a probability measure over the set of possible values A = range(X) = 
{ X(@) | w € Q} of X, not over Q. If range(X) = R, then we have the possibility 
that zx might be represented by a probability density function or pdf px: 


(7.1.5) my(F) =f px(x)de. 
F 


The probability distribution of a random variable may be a sum of Dirac 6-functions 
for a discrete probability distribution, or represented by a probability density function 
for a continuous probability distribution. 

We write X ~ a where z is a probability measure to mean 


Pr(X ¢ E)=a2(E) _ forall E measurable in range(X). 


We then say that zr is the probability distribution of X. 
Examples of probability distributions include: 


Bernoulli distribution: range(X) = {0, 1} and if p = Pr(X = 0) then 1 —- p= 
Pr(X = 1). The probability measure of X is 7x ({0}) = 1 — p and zx({1}) = p. 
If X is the result of a fair coin toss (heads for one, tails for zero) then the probability 
distribution is a Bernoulli distribution with p = 1/2. 

We write X ~ Bernoulli(p). 

Binomial — distribution: range(X)={0,1,2,...,n} with Prix =k)= 


(7) p*(1— p)"-*. This is the sum of n independent random variables X;, 
i=1,2,...,n, each of which has the Bernoulli probability distribution with 
Pr(X; = 1) = p. 

We write X ~ Binom(n, p). 

Poisson distribution: range(X) = {0, 1, 2,3, ...} with Pr(X = k) = e~* (AK /k!) 
for a given parameter A > 0. This corresponds to the limit as n — oo of a sum of 
n independent Bernoulli random variables X;, each with Pr(X; = 1) = A/n. 

We write X ~ Poisson(A). 

Uniform distribution: range(X) = [a, b] with constant probability density func- 
tion px(x) = 1/(b—a) fora <x <b. 

We write X ~ Uniform(a, b). 
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e Exponential distribution: range(X) = [0, 00) with probability density function 
px(x) =ae ™ for x > 0. This describes the waiting time for a repeatable event 
to occur the first time, where the event could occur equally likely at any moment. 
We write X ~ Exponential(q@). 

e Normal (or Gaussian) distribution: range(X) = R with probability density func- 
tion px(x) = (22)7'/?0—! exp(—(x — z)?/(207)). This distribution is famous 
as the “bell-shaped curve” and is important in many applications because of the 
Central Limit Theorem (Theorem 7.4). 

We write X ~ Normal(j1, 07). 

The multivariate normal distribution over R” has probability distribution function 
px(x) = (21)~4/? det V)~'/? exp(—3 (x — w)"V~' (x — w)) where w = E[X] 
and V is the variance-covariance matrix: V = E [(X — p)y(x — a)" |. 

We write X ~ Normal(p, V). 


There are, of course, many other important probability distributions. This is simply 
a selection of some of the most important ones. 

Note that the conditional probability Pr(X € E | Y = y) in (7.1.3) can be defined 
implicitly via 


(7.1.6) Pr(XeE|YeEF)= i Pr(X € E| Y =y)my(dy), 
F 


where sry is the probability distribution of Y. 
If X is a real-valued random variable, then the cumulative distribution function 
of X is cdfy : R — [0, 1] where 


(7.1.7) cdfy(s) = Pr(X <s). 
If X has a probability density function px then cdfy(s) = hi oo Px(x) dx. Note that 
we can recover py from cdfy as px(s) = cdfy(s). Cumulative distribution functions 


can be useful, for example, to find the probability distributions of the maximum of 
two independent random variables: 


Cdfmax(x,y)(S) = Pr(max(X, Y) < s) = Pr(X < s&Y <s) 
= Pr(X < s)-Pr(Y < s) =cdfy(s) - cdfy(s). 


Random variables can be transformed by functions: Y = f(X) is a new random 
variable obtained from X where f: R — R is a deterministic (that is, not random) 
function. Then 


ny(F) = Pr(¥ € F) = Pr(f(X) € F) = Pr(X € f"(F)) = ax(f '(F)) 


where f-!(F) = {x | f(x) € F}. If f is an increasing function, f~! is a function, 
and 


cdfy(s) = Pr(Y <5) = Pr(f(X) <5) = Pr(X < f-'(s)) = cdfx(f-'()). 
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7.1.2 Expectation and Variance 


If X has real values (that is, X: Q — R), then the expected value of X is 


(7.1.8) WD. = / X (a) Pr(do), 
Q 


provided X is an integrable function with respect to the Pr measure. In that case we 
can also express the expected value as the integral 


+00 
(7.1.9) U(X] = i x 1x(dx), 


(oe) 


and if X has a probability density function, 


+00 
(7.1.10) “xi= f x py(x) dx. 


(oe) 


For a discrete random variable with values x;, x2,..., 


(7.1.11) Eee: Pr(X = xj). 
i=1 


The value E [X] is also called the mean of X. 
If we wish to denote the expectation with respect to a given probability distribution 
x or probability density function p, we use the notation 


ix~a[f(X)] or Ex~p[f(X)]. 


There are random variables for which there is no mean (or that it is infinite). 
Consider the random variable X which has the value X = 2* with probability 2~* 
fork = 1, 2,3,.... This is a discrete probability distribution, so 


CO 
o[X] = > 2* x 2-* = too. 
k=1 


It should be noted that expectation is linear, just as integration is linear: 


[aX + BY] =aE[X]+fEl[Y] 


for non-random a and f. 
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The variance of the random variable X is defined as 


Var [X] = E[(X — E[X])’] = E[X?] —E[xP > 0. 


The standard deviation of X is 


stddev [X] = / Var [X]. 


The standard deviation is useful as it has the same physical units as X, and scales 
the same way: stddev [aX] = |a| E[X], while Var [aX] = a7Var [X]. 

A very important result is that for independent random variables X and Y, 
Var [X + Y] = Var [X] + Var [Y]. 


Theorem 7.1 Jf X and Y are independent random variables for which the expected 
values of both X? and Y? are finite, then 


(7.1.12) Var [X + Y] = Var[X]+ Var [Y]. 


Before we prove that variances of sums of independent random variables add, we 
need a Lemma that is useful in other circumstances. 


Lemma 7.2 [f X and Y are independent random variables for which E[X - Y] is 
defined (for example if X? and Y* both have finite expected value), 


(7.1.13) »[X - ¥] =E[X]-E[Y]. 


Proof We first define the joint probability distribution zy y via 
wx y(E x F)=Pr(X € E&Y € F), 


which is a measure over R x R. Then 


E[X-Y]= / x yay y(dx, dy). 
RxR 


Since |x y| < 5 (x? + y?), provided D[ X*] and E [Y?] are both finite, E |X - Y] is 
also finite. But if X and Y are independent, 


wx y(E x F)=Pr(X € E&Y € F) =Pr(X € E)- Priv € F) 
= 1x(E)- my(F). 


Thus 


7.1 Probabilities and Expectations 495 


ix-yi= [ xymyvldxdy) = ff xyaxtdx) yay) 
RxR RJR 


= [ xxxax) [ ymtdy) = 2[X] E(Y], 
R R 


as we wanted. 


Now we can proceed with the proof of the main result. 


Proof If Var [X] and Var [Y] are both finite and X, Y independent, then 


Var [X + Y¥]=E[(X + ¥)? -E[X+ YY] 
D[X? + 2XY¥ + Y° —E[X) — 2E[X]E[Y]-E[Y)] 
= E[X?]- es )[Y?] -E(vP 
+ 2E [XY] —2E[X]E[Y] 
then 7 Lemma 7.2, 
ar [X] + Var [Y] + 2E[X] E[Y] — 2E[X]E[Y] 
ar [X]+ Var [Y], 


as we wanted. 


The variance can be used to bound how much variation or randomness a random 
variable X has around its mean. One bound comes from Chebyshev’s bounds [246]: 


Theorem 7.3 If X is a random variable with finite variance, then for p = E[X] 
and o = stddev[X], we have for any k > 0, 


(7.1.14) Pr (|X — wl] > ko) < 1/k?. 


Proof Noting that 


o” = Var [X] = i (x — pw)? ax(dx), — dividing by 0”, 
R 


2 
i=[ (=) ax (dx) 
R Oo 
20° 
> / ( ) my (dx) 
(—00, u—ko JUL u+ka,+00) oO 


x= 
Knyldx) = Pr (J HI >k). 
lox 


> 


ee 


Rearranging gives 
Pr (|X — pl > ko) <1/k’, 


as we wanted. 
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7.1.3 Averages 


The expected value of a random variable X is given as an integral 


1X] = [ xx<as) = 
R 


We can approximate it by the average of independent samples from the same distri- 
bution X; ~ my for all k: 


1 


The average A,, is itself a random variable. While 


1 
[And = BLA) + i [X.]+---+E[X,]) =E[X], 


the main question here is: How close A, is to 44? The Laws of Large Numbers show 
that A, converges to the mean jz under mild assumptions: the weak Law of Large 
Numbers shows that, provided E [|X|] is finite, for ¢€ > 0 we have 


(7.1.15) Pr(|A, —“u| > €) ~O asn— oo. 


The strong Law of Large Numbers states that assuming that E [|X|] is finite, 
(7.1.16) Pr (lim ez ) aie 
n—-> oo 


If Var [X] is finite, then we have more detail about the variation of A, around the 
mean jz from the Central Limit Theorem: 


Theorem 7.4 Provided the random variables X; are independent and identically 
distributed with finite variance V, the probability distribution 1, of 


n 4 % 
= Jn (A, y) = Siete 


converges to the normal distribution in the sense that for any bounded continuous 
function @, 


2 
s[o(Sn i> fe a TON dapat, 
Jon 
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Exercises. 


(1) Show that if X, ~ Poisson(A,) and Xz ~ Poisson(A2) are independent, then 
X, + X2 ~ Poisson(A; + A2). 


(2) Suppose that X,, X2,..., X, are independent random variables, but all are 
distributed according to a cumulative probability function Pr [X; < x] = F(x). 
Show that 


Pr [max {X, | 1 <k <n} <x] = F(x)". 


(3) Suppose that X ~ Uniform(a, b). What are E [X], Var [X] and E Es ? 

(4) For a random variable X with real values the function xx(t) = ob es | is 
called the characteristic function of X. Show that the characteristic function 
is the Fourier transform of the probability density function of X, assuming 
that X has a density function. Show that if X and Y are independent random 
variables, then xx+y(t) = xx(t) xy(t). Also show that x,x(t) = xx(at) and 
xx+0(t) = e! xx(t). 

(5) Show that E[X] = —i x(0), and Var [X] = —x¥(0) + x, 0), 

(6) Given random variables X and Y with real values, show that X and Y are 

independent only if E[e'“**"”] = xx(s) xy (t) for all real s and t. [Note: The 

converse is also true, but takes more work. | 

Show that the characteristic function of aX for arandom variable X is xyx(t) = 

xx (at). Use this to show that if X; ~ Uniform(—1, 1) fori = 1,2,...,7 are 

independent, and S, = (1/n) }“7_, Xj, then xs, (t) = [sin(t/n)/(t/n)]". Using 

(sinu)/u = 1— qu? + O(u*), show that Xs, (t) > 1 as n > o. Also show 

that X js, (t) > exp(—t*/6) as n —> oo showing that ./7S,, has an asymptoti- 

cally normal distribution as n — oo. 

Show that if X ~ Normal(j, 07) then xx(t) = exp(iut — $00"). [Hint: If 

Z = (X — w)/o then Z ~ Normal(0, 1). Show that xz(t) = exp(—3?7) and 

use the rules from Exercise 4.] 

Show that if X and Y are independent random variables, then so are f(X) 

and g(Y) for any functions f and g: R > R. [Hint: To show Pr[f(X) € 

E&g(Y) € F] =Pr[f(X) € E] Pr[g(Y) € F] we should use f(X) € E if 

and only if X € f-'(E) :={z| f(z) € E}.] 

(10) Show that if X is a random variable that is never negative, then Pr[X > a] < 

[LX] /a. 


(7 


wa 


(8 


wm 


(9 


wa 


7.2 Pseudo-Random Number Generators 


Anyone who considers arithmetical methods of producing random digits is, of course, in a 
state of sin. 


John von Neumann 


“Think of a random number between one and a hundred.” 
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Algorithm 64 Example of pseudo-code for a generator 


1 generator mylist(n) 
2 while true 
3 yield n 

4 n<n+l1 
5 end 
6 

7 

8 

9 


end generator 


function print_natural_numbers() 
for x in mylist(1) 
print x 
end for 
end function 
/* alternative version */ 
function print_natural_numbers_alt() 
g <— mylist(1) 
while true 
print (next(g)) 
end while 
end function 


PPE HEHEPHE EBLE} 
WoOwmANaAURWNR OO 


Suppose you thought of 37. Is it random? It might appear so, but if we understand 
“random” to include the fact that any two numbers are equally likely, then so are 1, 
10, and 99. The point is that what a number is, does not make it random. A number 
is random if it is generated randomly. 

Lotteries often use a collection of light balls that are rotated in a drum, and at 
certain times a ball is drawn through a tube into a tray where they can be identified 
more easily. Is this random? Is tossing a coin random? These can be difficult questions 
to answer. There are mechanical coin tosses that are guaranteed to give either heads 
or tails as desired. The motion of the balls is governed by macroscopic physical 
processes, and so should be deterministic. If we knew the initial conditions and the 
motion of the drum precisely, then it should be in principle possible to determine the 
entire state of motion of the balls. However, very small perturbations of these initial 
values can give very different results. When we toss a coin, then the variation and 
imprecision of how we hold and toss a coin gives enough variation in the outcomes 
to make the result unpredictable—essentially a random outcome. 

Some physical systems such as those governed by quantum mechanics are in 
principle random. However, obtaining random numbers from these systems requires 
very sensitive detectors and delicate systems. Some physical systems involve thermal 
noise. Heat is, after all, energy in the form of kinetic energy of atoms, molecules, 
and their parts. Electronic systems are often subject to thermal noise that gives a 
background hiss as occurs on AM radio, for example. 

Computers are precise systems, so give precise deterministic outcomes. However, 
we often want to do computations with random numbers. What we create are not truly 
random numbers, but rather pseudo-random numbers. Pseudo-random numbers are 
generated deterministically and yet appear to have many properties of truly random 
numbers. 
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Rather than describe ways of creating sequences of pseudo-random numbers by 
means of functions, we use the concept of generators. A generator is like a function 
except that it returns a potentially infinite sequence of values. For the pseudo-code 
in this book, we use the keyword yield to return the new value in the sequence. 
However, the generator can continue doing computations after the yield, until the 
next yield statement when it returns the following value in the sequence. A gener- 
ator will continue performing operations as needed until it terminates. A generator 
will terminate when it meets a return statement. A generator is an example of a 
co-routine or task [247]. Generators are supported explicitly in a number of modern 
programming languages, such as Python, and can be implemented conveniently in 
other languages, such as Julia. 

To illustrate how generators work, Algorithm 64 shows an example of a generator 
and how to use it. This example would print 1, 2, 3, ... in order and without ending. 
In the pseudo-code here, we use either“for x in g...” toenumerate the values 
generated by a generator g, or “next(g)” to access the next value yielded by the 
generator. 

Pseudo-random number sequences are most naturally implemented as generators 
as described above. Algorithms that use pseudo-random sequences are also often 
best implemented in terms of generators. 


7.2.1 The Arithmetical Generation of Random Digits 


The generation of pseudo-random numbers often involves number theory, which is 
used to analyze their behavior. John von Neumann’s favored method was the so- 
called “middle-square” method [255]: take an n-digit integer x with n even. We pad 
x on the left with zeros if necessary. Compute y = x? and then take the middle n 
digits as the next item in the pseudo-random sequence. Unfortunately, this method 
often ends up in short cycles or becomes constant. 

An approach that creates better behaved pseudo-random generators is to use linear 
congruential generators: 


(7.2.1) Xt << mx,+b (mod n). 


Note that a (mod n) is taken to mean the reminder when a is divided by n: a 
(mod n) =r means that a = gn+r with g an integer andO<r<n.Ifa=qb 
for integers a, b and q, we say that b divides a, denoted b | a. 

Analyzing these pseudo-random generators involves basic number theory [216]. 
Since there are only finitely many possible values of x, (0 < x, <n for all k), there 
must eventually be a cycle of values x, = xx4,. The cycle length t > 0 should be 
much larger than the number of samples taken. Modern pseudo-random number 
generators should have extremely long cycle lengths. 

To carry out the analysis, we need to introduce some concepts from number theory: 
the greatest common divisor (gcd) of two integers x and y, denoted d = gcd(x, y), 


500 7 Randomness 


Algorithm 65 Extended Euclidean algorithm (recursive version) 
al function exteuclid(x, y) 
2 if y=0: return (|x|, signx, 0); end if 
3 q<—\xX/y]i r<—x-qy 
4 (d, u, v) < exteuclid(y,r) 
i) 
6 


return (d, v, u—quv) 
end function 


which is the largest positive integer d that divides both x and y, or zero if x = 
y = 0. This can be computed by the Euclidean algorithm or the extended Euclidean 
algorithm. The extended Euclidean algorithm is shown as Algorithm 65, which not 
only computes d = gcd(x, y) but also integers r and s where d =rx-+s y. Two 
non-zero integers x and y are relatively prime if gcd(x, y) = 1. 

Lagrange’s theorem in group theory implies that if a and n are relatively prime, 
then a®™ = 1 (mod n) where ¢(n) is the number of elements of the set 


{x €Z|ged(x4,n)=1&0<x <n}. 
The function ¢ (1) here is called the Euler totient function. Ifn = p}' py «++ p;' where 
each pj; is a prime number andr; > 1, then 


a 
G22) o(n) =] Py (Pj — 1). 
j=l 


We can start our analysis of (7.2.1) by noting that if gcd(m, n) = 1 and b = 0 then 
xp = m* x9 (mod n). If gcd(xo,n) = 1 and x¢4; = x, then m' = 1 (mod n) which 
is true if @(n) divides ¢. 

For most modern computers, it is convenient for n to be a power of two: n = 2’. 
Then @(n) = 2’~!(2 — 1) = 2’—!. The maximum period of the iteration xz. = m x, 
(mod n) is then @(n) = @(2”) = 2"—! = n/2, which can only be achieved if m and 
Xo are odd. To get a full period of length n we need b £0 (mod n). In fact, the 
period can be n, which is the maximum possible, of course. Indeed, this is desirable 
as then any xo € {0,1,2,..., — 1} is in the cycle of length n. 

The conditions for having cycle length n are given by Hull and Dobell [129]: 


Theorem 7.5 The iteration (7.2.1) has period n provided 


e gcd(b,n) = 1; 
e m=1 (mod p) if p is a prime divisor of n; and 
e m=1 (mod 4) if4 is a factor of n. 


The proof is an application of basic number theory. If 7 is a power of 2 then this 
reduces to m and b both being odd. 

Having period n for (7.2.1) has one very practical consequence: no matter what 
Xo (the seed) is chosen to be, the average number of times that each possible value 
in {0, 1,2,...,— 1} occurs is 1/n. 
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Fig. 7.2.1 Planes evident for 
a subsequence of a linear 
congruential generator 


This is not sufficient for the development of practical pseudo-random number 
generators. For one thing, because x;+, is a function of x;, these are not independent. 
We would like them, however, to appear to be nearly independent. Quantities such as 
correlations should be close to zero. If r, = x,/n and n is large, then r, will appear to 
be approximately uniformly distributed (in the long run) in the range [0, 1). If r%41 = 
mrz +b’ (mod 1) with b’ = b/n, then D [re +1 rx] = D [re 1| Ure] + A/m)[1 — 
6b'(1 — b’)]. This differs from what is expected for independent rz4, and rg by 
O(1/m). So large m gives a correlation that is close to zero. 

Linear congruential pseudo-random number generators are, however, not much 
used now for pseudo-random number generation as they have other weaknesses. 
Knuth [145] analyzed linear congruential generators from the point of view of cryp- 
tography. He was able to show that it was possible if n is a power of two, to deter- 
mine m and b from just the leading digits | xx / 2° | with relatively small number of 
elements of the sequence | xx / 2 | ,k=0,1,2,.... Of course, the issues involving 
cryptographic use of pseudo-random generators are different from the numerical 
issues, but some of the same questions arise. 

One of the weaknesses that linear congruential generators have is exposed by 
looking to higher dimensions. Entacher [85] gives examples of high period linear 
congruential generators that have serious weaknesses. One example is n = 2°27, m = 
2 396 548 189, and b = 0 with x) = 1. This method appears to be a serious pseudo- 
random number generator. But iterates of a subsequence (Xj,, X(j+1)r, X(j+2)r)/m 
with r = 105 are plotted in Figure 7.2.1 as points in R*, clearly showing that this 
subsequence clearly falls into planes. 

This illustrates a related result of Marsaglia [171]: 


Theorem 7.6 [fb = 0 then the points (xXx, Xp41, ---, Xk+d—1) € (0, 1,2,...,n - i}? 
fork = 0, 1,2, ... generated by (7.2.1) lie within no more than (d!n)'/“ hyperplanes 
of the form cox, + CyXpp1 + +++ + Ca—-1Xe4a-1 € Z. 
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Some linear congruential generators can result in the points (x;, Xk41, +--+ Xk+d—1) 
lying in far fewer hyperplanes. These kinds of defects can be detected through spectral 
tests: Let j = (ji, jo,.--, ja) € Z4 with 0 < jp < n/r for £=1,2,...,d. Create 
a histogram over the set {0, 1, 2,..., — 1} for a partition into hypercubes 


Aj = {i | AM@/r) <i < Gi + DM@/r)} x +++ x fia | jan/1) Sta < Ga + Dn/r)} 


for some divisor r of n. We let h; be the number of points (x;, XK41, ---, Xk4+d-1)- 
Taking the discrete Fourier transform of the d-dimensional array h ; gives information 
about whether the points are restricted to a set of parallel planes. 

Another property of pseudo-random number generators that is greatly desired 
is the lack of long-range correlation. That is, we want x, and x,4, to appear to 
be independent for k = 0,1,2,... for fairly large p. Part of this is wanting to 
have long-period generators (x, A xz+,) but might include other conditions such as 
Xp FN — Xp+p for modest p. Because of these conditions, designing a good pseudo- 
random number generator can be a difficult task. Most modern pseudo-random num- 
ber generators are not linear congruential generators. We will look at some. 


7.2.2 Modern Pseudo-Random Number Generators 


In this section, we will look at a few modern pseudo-random number generators. 
All of them are essentially linear in that they can be represented fairly compactly 
as having a state vector update of the form s,,; <- As, (mod n). The output of 
these methods is often tempered in the sense that the actual output is not s, but rather 
trunc(T s, (mod n)) where T is an invertible matrix modulo n, and trunc(z) extracts 
leading bits from z. 


7.2.2.1 Mersenne Twister 


Mersenne Twister [174] is a generator developed to overcome the limitations of linear 
congruential generators with a period that is a Mersenne prime: 2? — | where p is 
itself prime. The specific method described uses p = 19 937 giving an extremely long 
period. The basic idea can be described in terms of polynomials over Z = {0, 1}. 
Arithmetic in Zz can be implemented very easily in modern computer hardware: 
addition in Zz is implemented as exclusive or, while multiplication is simply ‘“‘and”- 
ing the two values. 

To describe this method, let Z,[t] be the set of polynomials in t with coefficients 
in Zy. We can do computations in Z,[t] modulo f(t) for a polynomial f(t) using 
synthetic division. If f(t) is irreducible (that is, cannot be written as f(t) = g(t) h(t) 
with non-constant polynomials g(t) and A(t)), then Z,[t]/f (t) (the polynomials in 
Za[t] reduced modulo f(t)) forms a field — that is, every polynomial in Z[t] has an 
inverse modulo f(t). The number of elements of this field is 22S So if f(t) has 
degree p, then the number of non-zero elements of Z2[t]/f (t) is en 
The non-zero elements of a field form a group under multiplication, and so the order 
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of any non-zero element of Z2[t]/f (t) under multiplication must divide 2? — 1. As p 
was chosen to make 2” — | a prime, this means that there are only two possible orders: 
one and 2”? — 1. The only element of Z,[t]/f (t) with order one under multiplication 
is the constant polynomial 1. All other non-zero elements of Zo[t]/f (t) have order 
2? —1. 

The hardest problem in creating a method of this type is to find a suitable irre- 
ducible polynomial f(t) € Z,[t] of the desired degree. The authors of [174] found an 
efficient way to test for irreducibility of polynomials in Z,[t]. Part of the trick used is 
noting that for any polynomial g(t) in Z.[t] we have g(t?) = eth). The irreducible 
polynomial found is a sparse polynomial in the sense that most coefficients are zero, 
which is important for efficiency of the generator. Finding an irreducible polynomial 
in Z,[t] of degree p consists of generating polynomials of this degree and then testing 
to see if the generated polynomial is irreducible. Fortunately, irreducible polynomi- 
als of degree p in Z,[t] are fairly common. The number of irreducible polynomials 
of degree n in Z,[f] is given by the formula (see [217, Chap. 2]): 


* 7 w(n/a) a 
n 


d\n 


where ju is the Mobius function: i(k) is zero if k has a non-trivial square factor, is +1 
if k is the product of an even number of distinct primes, and is —1 if k is the product 
of an odd number of distinct primes. In particular, ifn = p is prime, then the number 
of irreducible polynomials of degree p in Z,[f] is (2? — 1)/p out of 2? possible 
polynomials of degree p in Z2[t]. Thus “randomly” selecting polynomials of degree 
p has aprobability of  1/p of selecting an irreducible polynomial. For p = 19 937, 
finding an irreducible polynomial would still require an automated search, but it is 
still quite feasible. 

The actual algorithm implements an iteration on vectors in R’“~" (nw —r = 
2? — 1). We split the vectors into parts x74, € ZY and x} € Zy": 
(7.2.3) 


Xetn 0 ly s Xe+n-1 Xe+n-1 
Xetn-1 Ls @) Xein—-2 X¢4+n—-2 
X¢4+n—-2 . Xe4+n-3 Xe4+n-3 

Ty 
Xe+m+i X¢4+m+m Xe+m+m 
0 
X¢ Iw Ty, 0 Xe-1 Xe-1 
u u u 
Xo X% xe 
_ Tyr 
where S$ = A I and 
Ww 
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0 aw-1 
1 0 aAw—2 
(7.2.4) A= 1° 
a) 
1 a 


Note that x denotes the upper w — r components of xx. 
The matrix in (7.2.3) is sparse for efficiency. Its characteristic polynomial is easily 
computed explicitly: 


r-1 
xa) = (er HO YY ae $0)” rn T4oym Ayr j-l 
j=0 


w-l 
$a (eye, 


j=r 
Note that the specific version of Mersenne Twister can be represented by the vector 
= T w 
a=[do, a, ..., dw-i]’ € Z 


along with w, m,n, and r. In implementations, a can be represented by a single bit 
string, or integer. For example, the method MT19937 has w = 32 (exploiting 32 bit 
architectures), n = 624 (the number of 32 bit “words” used), m = 397, r = 31 (one 
bit in x; for efficiency), and a represented in hexadecimal by 0x9908BODF. 

Rather than use the raw bits produced by the algorithm, the output is tempered. 
Instead of returning the raw bits produced, a tempering function is applied to the raw 
output of the generator. The paper [174] recommends giving the output y, = T x, 
with T an integer matrix given in terms of bit shifts and additions modulo 2. 

The Mersenne Twister generator has been criticized for various defects. For exam- 
ple, if the initial state has many zeros, it can take many steps before the output starts 
appearing to be random. This problem can be offset by an improved initialization 
scheme [173]. A deeper criticism of the Mersenne Twister generator can be found 
in Vigna [254]. Improvements on the original Mersenne Twister which scramble the 
bits more effectively can be found in [197]; Vigna also proposed other bit-scrambling 
methods in [253] which are not subject to these criticisms. 


7.2.2.2 Permuted Congruential Generators 


Permuted congruential generators (PCG’s) [193] are a way of adapting linear con- 
gruential generators to give better output. Essentially PCG’s focus on tempering 
linear congruential generators by means of nonlinear functions (with respect to Z2) 
of tuples of consecutive outputs (x,_,, X_--41,---,X~—-1, Xx) Of the linear congru- 
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Algorithm 66 Combining generators 


1 generator combine_xor(g1, g2) 


2 while true 

3 yield next(g,) +next(g2) (mod 2) 
4 end 

5 end 


ential generator based on bit shifts and rotations. These clearly cannot change the 
period of the underlying generator but can avoid the problem of consecutive outputs 
lying on planes as noticed by Marsaglia. The tempering functions used in [193] are 
guaranteed to be bijections, so the period of the output cannot be any less than the 
period of the underlying generator. 


7.2.2.3 New Generators From Old 


Tempering can be used to improve the quality of output from underlying generators. 
The tempering function should be a bijection so that the period of the tempered 
output is the same as the period of the underlying generator. 

There are also methods of combining generators to increase the period of the 
combined generator. For example, Algorithm 66 shows one way to combine vectors 
of bits generated by two generators g; and go. 

If t; and ft) are the periods of generators g; and g2, respectively, then the 
period of the combined generator is the least common multiple of t and fh: 
Iem(t, t2) = (t)t2)/ged(t), t2). This would be most effective when applied to, for 
example, Mersenne Twister type methods of different periods but not to linear con- 
gruential generators with moduli that are powers of two. If g; and gz uniform gen- 
erators of real numbers in the interval [0, 1], then returning the fractional part of 
next(g) + next(g2) would also be an effective way of combining two generators of 
this type. 


7.2.3 Generating Samples from Other Distributions 


Often it is important to generate samples that have probability distributions other than 
a uniform distribution. However, we can use generators that produce 
outputs uniformly over the interval [0, 1] to produce real outputs that have a spec- 
ified distribution. One of the most straightforward methods to do this by using the 
inverse function to the cumulative distribution function. To sample from a probabil- 
ity distribution with density function p(x), we can use the cumulative distribution 
function F(x) = ha oo P(s) ds: Suppose U is uniformly distributed on [0, 1], and 
X = F~'(U). Assuming that F is strictly increasing, 
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Algorithm 67 Box—Muller method for generating normally distributed samples 
1 generator boxmuller(U) 
while true 
uy <next(U); uz <next(U) 
r<— /—2 In(u1) 


2 
3 
4 
5 yield r cos(27u2) 
6 
7 


end while 
end generator 


Fig. 7.2.2 Ziggurat 
algorithm 


Pr[X <x] = Pr[F(X) < F(x)] =Pr[U < F(x)] = F(x). 


Thus X has the cumulative distribution function F’, as we wanted (Figure 7.2.2). 

This can be applied to computing normally distributed pseudo-random variable: 
simply set F(x) = (1 + erf(x)) where erf(x) = (2/./7) i exp(—s?/2) ds is the 
error function. Given U sampled from a uniform distribution we solve F(X) = U 
for X. Since pseudo-random uniformly distributed values can be zero or one, which 
would result in infinite results, we need to treat these extreme values carefully. If we 
use pseudo-random generators with values k/n, k = 0, 1, 2, ..., n — 1, then the 
value 0/n should be replaced by either 1/n or 5 /n. The value n/n = 1 might also 
need to be replaced by (n — 1)/n = 1 —1/n or (n — $)/n. 

An alternative method for creating normally distributed pseudo-random samples 
from uniformly distributed samples is the Box—Muller method [27], as shown in 
Algorithm 67. Note that U is a uniform generator providing samples uniformly 
distributed over [0, 1]. Care should be taken to ensure that U does not generate 
exactly zero. This can sometimes be achieved by replacing U by | — U if U itself 
can exactly generate zero. 

An alternative approach is the ziggurat algorithm [172]. 

The ziggurat algorithm assumes the probability density function p(x) is a mono- 
tone decreasing function. Assuming p is zero outside [xo, 0©), we subdivide the 
interval into pieces [x;, x41), kK = 0, 1,2,...,n — 2, where Nags p(x)dx = 1/n. 
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Algorithm 68 Ziggurat algorithm 
al generator ziggurat(U, p, x, fallback) 


2 while true 

3 u<nnext(U); k<|u 

4 if u>n-1 

5 yield fallback(u/n) 

6 else 

7 X Xe + Kp — XK) CU — k) 

8 y <— p(xg) next(U) 

9 if y< p(x): yield x; end if 
10 end if 

le end while 


12 end generator 


There is one infinite interval [x,_;, 00) where a p(x) dx = 1/n. These x,’s are 
pre-computed and passed to the ziggurat algorithm. 

The ziggurat algorithm is shown as Algorithm 68. In the algorithm U generates 
samples from the uniform distribution Uniform(0, 1). The basic idea is to first select 
one of the intervals [x,;, xx+1), which are equally probable. Once this is done, we 
create a point (x, y) uniformly distributed over the rectangle [x,, xx+1) x [0, p(xx)]. 
If (x, y) is under the graph of p(x) then the point (x, y) is accepted and x returned. 
Otherwise, the sample is rejected. Rejection is relatively rare if the intervals [x,, x44) 
are small as the rejection probability is the area of the rectangle [x,, xz41) x 
[0, p(xx)] that is not under the curve y = p(x) divided by the area of [x;, x%41) x 
[0, p(xx)]. This means that with narrow rectangles on average little more than two 
samples of a uniform distribution are needed to generate a new sample of the given 
distribution. There is still the necessity of handling the unbounded interval [x,,_;, 00) 
using an alternative (or fallback) function, when this is needed. In these relatively 
infrequent cases, the inverse of the cumulative distribution function can be used. 


7.2.4 Parallel Generators 


In many applications, such as for Monte Carlo methods, it is desirable to have multiple 
processors running pseudo-random number generators. These generators should be 
statistically independent, or at least appear to be independent. In a parallel computing 
environment, the same generator is typically used by each processor. Since pseudo- 
random generators are deterministic processes, we need to initialize each copy of the 
generator differently so as to at least avoid overlap in the generated sequences. 
There is a way of doing this efficiently for large period linear generators. For 
example, the Mersenne Twister generator (see Section 7.2.2.1) has period 2? — 1 
where this is a Mersenne prime, with the value p = 19937. We do not expect even 
very long running Monte Carlo methods to use more than, say, 2!°° ~ 1.27 x 10°° 
samples. (Avagadro’s number ~ 6.0 x 107%, is the number of hydrogen atoms in 
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Algorithm 69 Matrix powers via repeated squaring 


1 function mpower(A, s) 


2 if s=1: return A; end if 
3 if s even 

4 return mpower(A?, s/2) 

5 else 

6 return A mpower(A?, (s — 1)/2) 
7 end if 

8 end function 


one gram of hydrogen. This is a rough upper bound to the number of objects that any 
current or future computer memory will be able to hold.) So if we ensure that the 
generators are initialized to have this spacing (or more) of the generated sequences, 
then there is unlikely to be any overlap between the sequences. It would also be wise 
to avoid spacing that is a divisor of the period of the generator. 

Of course, with spacing s we could, in principle, run the generator with the stan- 
dard initialization k s times to prepare the generator for processor k. If s is as large 
as 2! this would be unacceptably long. However, for linear generators, this can be 
done efficiently by repeated squaring. Consider first, linear congruential generators 
(F221) 

Xet1 <— mx~,+b (mod n). 


This can be represented as a linear update of the state: 


Py ]-[st]z] oe 
Pr ]-[et] Er] ome 


We can use repeated squaring to compute A* for a matrix A using O(log s) matrix 
multiplications, as shown in Algorithm 69. 

This same idea can be applied to the Mersenne Twister generator, although the 
matrix is a p x p binary matrix, and for p = 19937 each matrix multiplication 
can be expensive. An alternative is to combine several Mersenne Twister genera- 
tors with Mersenne prime periods 2?! — 1, 2? — 1, ..., 2?’ — 1 and with smaller 
distinct values for pi, p2, ..., py, combining the outputs using Algorithm 66 for 
example. The period of the combined generator would be Tha 2” —-1l)x2! 
where g = ei pj. Initializing a generator to start at x, would then require mul- 
tiplying r matrices of sizes p; x pj; (j = 1,2,..., 7) O(ogs) times for a cost of 
O(dog s) eS D;) time and O((log s) a D3) memory. 


We can then compute 
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Exercises. 


(1) Consider a linear congruential pseudo-random number generator: x;4.=mx;,+b 
(mod n). Let d = gcd(m, n). Show that x,4; = b (mod d) for all k. 

(2) Show that if gcd(b, n) = | and gcd(m — 1, n) = 1, then the generator x44. = 
mxz +b (mod n) has the same period as yz; = myg (mod n). 

(3) Most pseudo-random number generators do not reveal their entire state. For 
example, a linear congruential pseudo-random number generator working mod- 
ulo 2 will usually not return x;, but only the most significant 32 bits of x,. 
Explain why returning the least significant 32 bits of x; is not helpful. 

(4) Consider the vector linear congruential pseudo-random number generator: 
Xxn41 = Mx,+5 (mod n). If n is prime (so that the integers modulo n form 
a field) and M — J is invertible modulo n, show that y, = x, + (M—1)"'b 
(mod n) satisfies y,,, = M y, (mod n). If z, = T y, with T invertible mod- 
ulo n, show that Z,4) = TMT~!z, (mod n). 

(5) Show that if f(x) is a polynomial in x with integer coefficients, then f (x)? = 
f(x?) (mod 2). 

(6) Most n x n integer matrices are invertible modulo 2. We can more precisely 
estimate the probability of generating an invertible n x n integer matrix A 
modulo 2, with each entry being chosen as either zero or one with probabil- 
ity ; independently. Show that the probability that A is invertible modulo 2 is 
Pr = Nese! — 2-/), [Hint: Let A = [a), a, ..., a,] be the matrix generated 
as indicated above. The column a, must be a non-zero vector: there are 2” — | 
of these. For A to be invertible, az must not be a multiple of a,; once a, is 
determined, there are 2” — 2! such vectors. Then a3 must not be a linear com- 
bination of a, and a>; once a, and a» are determined, there are 2” — 2? such 
vectors. Repeating this argument, there are fae — 2/) such choices of A, 


out of 2”” possible binary n x n matrices.] Generate 1000 binary 5 x 5,10 x 10, 
and 20 x 20 matrices pseudo-randomly; record the number of these generated 
matrices that are invertible modulo 2. How do the empirical probabilities of 
generating invertible binary matrices modulo 2 relate to the predicted value p,,? 
What is limy—+oo Pn to six digits? 


7.3 Statistics 


Statistics are ways of studying probability distributions through sets of samples (usu- 
ally independent) taken from that distribution. Statistics include averages of different 
kinds, measures of variation about an average, and other measures of properties of 
that distribution. 
Given a sample {x}, X2,..., xy}, we have an empirical probability distribution 
given by 
Pr [Xsampie € E] = No" fi | x; € E}ls 
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that is, each x has probability equal to the number of times x appears in the sam- 
ple (repetitions allowed) divided by the number of elements in the sample. Empir- 
ical distributions are naturally discrete, but can be used to represent or approxi- 
mate continuous distributions in the sense that for bounded continuous functions /f, 
N7! ys f(x;) > E[f(X)] as N —> oo where X is distributed according to the 
underlying probability distribution. 

Often probability density functions are parameterized p(x; 0) with parameter vec- 
tor 9. Any method for estimating 8 from a sample is called a statistic. Since samples 
are random variables, an estimate 6 computed from the sample must itself be a ran- 
dom variable. The estimate is unbiased if E [6] = 0. The estimate is asymptotically 


unbiased if the estimate 0 y for sample size N satisfies E [Ov] |- 6 as N > o. One 
class of estimators are maximum likelihood estimators (MLE’s). The MLE estimator 
6 maximizes the likelihood inne , P(x;, 9) over @ for the sample {x;, ¥2, ..., Xy}. 

Often it is easier to analyze the logarithm of the likelihood: + | In p(xi, 0). 


7.3.1 Averages and Variances 


The word “average” is often used as a synonym for “mean” in the sense of (7.3.1). 
But “average” can also refer to a median or a mode, also described here. 

The mean of a real- or vector-valued random variable X is the expectation E [X]. 
The mean of a sample {x;,%2,...,xXy} is 


1 N 
(7.3.1) ¥= woe 


The sample mean is the expectation of a random variable X sampie- 

The median of a real-valued random variable is the value m where Pr [X < m] = 
5; if there is no such m because the probability Pr[X < m] jumps over s, then 
we usually take m = (my + mz) where Pr [X < m2] > 5, Pr [X > m,] > ; with 


my < m,. The median of a random variable X is also the minimizer of E [|X — s|] 
over s. The median of a set of real values {x), x2,..., xy} is the median of the 
random variable X sampie; its value is m where |{i | x; < m}| = |{i | x; > m}]|. If the 


sample is sorted so that x} < x2 <--- < x, then the median is x(y+1)/2 for N odd 
and 5 (x(w/2) + x(w/2)+1) for N even. 

The mode of a discrete-valued random variable is value x that maximizes 
Pr [X = x]. For a continuous random variable, the mode maximizes the probability 
density function. The mode of a sample is not so easily defined in general. 

Both the mean and the median of a real-valued random variable and for a set of 
samples can be determined by through optimization problems: 
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mean minimizes E [(x _ m)° | over m, 


median minimizes E[|X — m|] over m. 


The variance of a random variable X is Var [X] = mre _ u)’| where p= 
4, [X] is the mean of X. Usually the variance is written as o* where o > 0 is the 
standard deviation of X. If X has a physical quantity, then jz and o both have the 
same physical units as X, but the variance has the units of X”. 

A commonly used alternative formula for the variance is 


Var [X] = E[X*] — E[X? =E[X?] - w’. 


It is tempting to estimate Var [X] with a = Var [Scene = (1/N) s 3 (x; — ft)? 
and ji = (1/N) Ss x;. But this is actually a biased estimate of the variance. An 
unbiased estimate is 


x 1 
(7.3.2) oun = —— (ai — 7)”. 


On the other hand, the (biased) maximum likelihood estimator for the variance using 
the normal probability distribution p(x; w,0) = (2107) 1/2 exp(—(x — pe)? / (207)) 
is (see Exercise 2) 


~ 


N 
1 Ae 
(7.3.3) o-uMLE = W 2 (x; — pL)’. 


The maximum likelihood estimator of the variance for normally distributed samples 
is still asymptotically unbiased. 


7.3.2 Regression and Curve Fitting 


Regression and curve fitting are also important computational tasks in statis- 
tics. A common approach is to use least squares to fit a set of data (x;, y;), 
i=1,2,...,N, witha function x tb paar cj9;(x). We assume a statistical model 
Yi = D1 C79; (";) + €; where €; follows a Normal (0, o”) distribution and the €;’s 
are mutually independent. The likelihood function is 
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N 
L(c) = | [@2)7'” exp(-e?/0”) 


i=1 


= (22) 8? exp(—e’ €/o”), so 


1 N 1 
56 €= In(2r) — —(e — y)" (ec — y) 
Oo 2 oO 


N 

In L(c) = 5) In(2z) 
where ®;; = gj(x;). So maximizing L(c) for c is equivalent to minimizing 
(®e — y)" (®e — y). That is, we are minimizing the sum of the squares of the errors. 
This can be done using either normal equations or the QR factorization. The optimal 
c is given by the normal equations 6’ 6¢ = @7 y. If c is the true set coefficients, 
y = ®c+€ and so ©’ OC = O' dc 4+ O" € and therefore € = c + (O07 O)“'!O7 €. 
The error in the computed coefficients € — c = (®’ &)~'" € is distributed accord- 
ing to the Normal (0, o*(&! ©)~!) distribution. This estimate is unbiased: E [¢] = e. 
We might want to estimate the variance o? of the €;’s by usingé = y — &¢. However, 
this will lead to a biased estimate of 07: 


€=y—O¢=y-— 0(o'0)'o'y 
= [1 — 0(0'6)'0"]y. 


The matrix P := I — ®(@7 )—'®" is the orthogonal projection onto the orthogonal 
complement of range ®. Since y = ®c + €, 


€= P(®c+€)= Pe 
so € is in (range®)+. We can consider € to be distributed according to the 


Normal(0, o? P) distribution as P? = P = P? = P’ P, understood as the limit of 
the distribution of Normal(0,07P +a) asa@ | 0. Also 


(7.3.4) i [e7€] = Ele’ P’ Pe] = E[e’ Pe] = trace(P) 0”. 


For y € R", c € R", P is the orthogonal projection onto (range &)+, which has 
dimension n — m. Since P is symmetric because it is an orthogonal projection, it 
has real eigenvalues that are either zero or one: n — m eigenvalues are one and m 
eigenvalues are zero. Thus trace(P) =n — m. So E [ere] = (n — m)o’. An unbi- 
ased estimate of 0” is then * = E[€’€] /(n — m), not E[e"€] /n. 

What if €; ~ Normal(0, 0?) with different o;’s? Then the likelihood function is 


L(c) = (22)~*” exp(—e? D~e) 
where D = diag(o), 02, ..., On). This leads to a weighted least squares problems. 


min(®c — y)’ D~?(®e — y), 
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which can also be solved by normal equations 6’ D-?®c = 6" D~*y, or the QR 
factorization of D~'®. 
There are also nonlinear regression problems, such as mixed exponentials with 
unknown rates: 
yy = ay exp(—bit;) + az exp(—bot;) + €;, 


with e; ~ Normal(0, 07). If we apply maximum likelihood methods to this, we want 
to minimize L(a, b) := Eee — (a; exp(—bit;) + & exp(—bot;)))” with respect 
to a1, a2, b;, bo. Note that the loss function L(a, b) is invariant under swapping 
the parameters for the two terms (a), b1) < (a2, bz). Since we do not expect that 
the solution will be symmetric ((a;, b1) = (a, b2)), we will probably have multiple 
local minimizers. This means that L(a, b) would not be convex, but can have multiple 
local minimizers. Standard optimization algorithms can have difficulties in these 
situations. 


7.3.3 Hypothesis Testing 


Data are used to test hypotheses. There are two main ways of doing this. One is to 
use Bayes’ theorem for conditional probability. The other is to use “p-values” for 
checking the plausibility of a basic hypothesis. 


7.3.3.1 Bayesian Inference 


Given a data set D, how likely is a hypothesis H? This is the conditional probability 
Pr [H | D]. But this is usually very difficult to compute directly. Instead, it is much 
easier to compute Pr[D | H] as the hypothesis H is a statement about the nature of 
the data D. From Bayes’ theorem, 


prt) py, PELD& HL _ Pri | HIPr LH] 
THEI DI= py 3, PD | Pri] 


where H’ ranges over all plausible hypotheses. The value Pr [H] is the probability that 
hypothesis H is true before we have any data about it. This probability Pr [H] is the 
a priori probability, while Pr [H | D] is the a posteriori probability of hypothesis H. 

Estimating Pr [HH] is often a subjective matter. Consider the question, “What is 
the probability that the sun will rise every 24 hours?” We might have personally 
observed these occurring tens of thousands of times and have historical records 
going back hundreds of thousands of times before that. But what should we assign 
to this hypothesis before we have evidence, or before we know anything about the 
sun? Without evidence we have very little basis for any computation of Pr [H]. We 
can make an arbitrary assignment Pr[H] = 5. The observed data of centuries of 
observing the sun rise every day would then give Pr[H | D] very close to one. On 
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the other hand, assuming Pr [H] = 10~°, Pr[H | D] might not be close to one even 
with centuries of observations. 

There is a way of avoiding making assumptions about the a priori probability, 
and that is to look instead at the odds ratio Pr[H] /Pr [not H]: Note that 


Pr{H|D] _ Pr{D|H] _— Pr{H] 
Pr{notH | D]  Pr[D| not H] Pr[not H] 


We are not looking for the absolute value of Pr[H | D] or even Pr[H | D] /Pr 
[not H | D]. Rather we see how the evidence affects the odds ratio. If the odds ratio 


(7.3.5) Pr[D | H] /Pr[D | not H] 


is large, then we have strong evidence for H over not H. 

If we have continuous data, using odds ratios can be a much more effective 
technique: Pr [D | H]is typically zero if the items in the data set D are sampled froma 
continuous probability distribution. The ratio Pr[D | H] /Pr[D | not H] would then 
be “0/0” and undefined. However, if we consider data sets D’ generated according 
to the hypotheses H and not A and D as the given (fixed) data set, we can replace 
Pr[D | H]/Pr[D | not H] with 


a Pr [|| D’ — D| ee | H| = Ppo'|H(D) 
«40 Pr[||D’— D||<«€|not AH] pp not (DP) 


(7.3.6) 


where ppjH and ppnct H are the probability density functions for the data set D 
given H and not H, respectively. The appropriate odds ratio for continuous data is 
Pp\H(D)/ Pont HD). 

Suppose we take N independent samples x;, i = 1,2,..., N. We wish to deter- 
mine whether the samples come from either a probability distribution with density 
function p;(x) (hypothesis H) or with density function p2(x) (hypothesis not #7). 
Then ppjH(%1,%2,...,XN) = res Pi(xj;) and ppinct #(%1, X2,...,XN) = 
Wy, P2(x;). Thus 


PrlH|D] (7 piei)\) Pr] 
Pr [not H | D] vt P2(Xi) Pr [not H] 


The odds ratio for hypothesis H is improved by a factor of We, (p1(%;)/p2(%;)). 


7.3.3.2 Hypothesis Testing Using p-values 


The idea here is to compute a statistic S from the data set D, and see how consistent 
the value of this statistic is with a hypothesis to be tested called the null hypothesis 
Ho. If the computed value of the statistic S for data set D is s, then we test if 
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(7.3.7) Pr[S >s | Ho] < p 


where 0 < p < 1 isa pre-specified p-value. Typically, p is taken to be 0.05 or 0.01, 
representing testing at the 5% and 1% levels, respectively. If Pr[S > s | Ho] < p 
then we reject Ho, the null hypothesis. The idea is that under the null hypothesis, the 
chance of seeing the value of the statistic S as extreme as the observed value s is less 
than p. 

For example, suppose a number of values x,, x2,..., xX, are measured. Assume 
that the null hypothesis Ho is that each x; ~ Normal(0, o7) independently with 
o” given. If the statistic S is the mean of the values, S = (1/N) a x;, under 
the null hypothesis, S ~ Normal(0, o7/N). If Pr[S > s | Ho] = f° (m0?/N)~'/? 
exp(—Nz?/(207)) dz < p then we reject the null hypothesis. 

While this is a standard method for identifying when “something interesting” is 
happening, there are a number of ways in which this approach can fail. This is partic- 
ularly true where there are multiple tests or tests of multiple hypotheses. Situations 
like this often arise with, for example, high-throughput testing. For example, genetic 
markers can be tested for connection with, say, cancer likelihood. The null hypothe- 
sis for a given genetic marker would be that the genetic marker has no effect on the 
cancer likelihood. If we have an estimate po of the “background” probability of a 
certain cancer occurring, the null hypothesis would be that a person with the specified 
genetic marker has probability po of having cancer. Under the null hypothesis, the 
number of people in a random sample of N people with the genetic marker that also 
have cancer is a random variable with the Binomial(N, po) distribution. The statistic 
S used would be the number of people in the sample with cancer. Then 


N 


Pr[S>k]=) > ()ria Spo: 


jek 


IfPr[S > k] < p fora threshold p-value, then the null hypothesis would be rejected: 
the genetic marker seems to be positively correlated with cancer. 

With high throughput testing, multiple genetic markers are tested simultaneously. 
If there are m distinct genetic markers tested, then (assuming independence of the 
genetic markers) then the probability that some genetic marker is deemed significant 
for cancer for the given value of p « lis 1 — (1 — p)” © 1 — exp(—mp). If mp is 
not small, then there is a significant probability of a “positive” result even if the null 
hypothesis holds for all tested genetic markers. Positivity bias is also important: one 
study might look at m potential genetic markers. But different groups and laboratories 
might also be looking at other genetic markers. If there are M groups, each studying 
m genetic markers, the probability that someone will find a “positive” result is fairly 
high if Mmp is not small even if the null hypothesis is true. Even if each group is 
careful to make sure that mp is small, the group that finds a “positive” result would 
not know about all the other M — 1 groups with the negative results: the groups with 
negative results usually do not publish their results, but the group with the positive 
result would. 
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Bayesian inference is not immune to these problems. If enough studies are done, 
randomness will ensure that spurious “positive” results will eventually occur whether 
Bayesian inference or p-value style hypothesis testing is used. Independent confir- 
mation should be performed, however the original results are obtained. 


Exercises. 


(1) Given that a collection of random variables X; ~ Poisson(A), i = 1,2,..., N, 
are independent, show that the MLE for A is Ae LE = eee X;)/N. 

(2) Suppose that Y; = w+ ¢;,i = 1,2,..., N, where the e; are independent random 
variables with €¢; ~ Normal(0, 0”). Show that the MLE’s for jz and o? minimize 
(1/(207)) ie {i — p)? +N Ings) Use this to show that the MLE esti- 
mators are jiyte = (1/N) °®, Y; and 6}, 7 = (1/N) Fi, - Puce) 

(3) From the previous Exercise, show that E pire = = (1-1/N) o” so that G Ga ijs 
is not an unbiased estimator of o”. However, note that it is an asymptotically 
unbiased estimator as N — oo. 

(4) ANA radioactive tracer is used to identify how much of a certain molecule is 
within an organ of the body. A Geiger counter is used to estimate this over 
periods of time. The rate at which the Geiger counter “counts” is p(t) = ae~"", 
which decays exponentially as the traced molecule leaves the relevant organ. This 
means that the total count C; for a time interval [#;, t;4.] has a Poisson distribution 
C; ~ Poisson(A;) where A; = =i ‘| qe”! dt. Note that counts C; and C; with 
i # j are assumed to be independent. From this develop, an MLE uc for 
a and b in terms of the counts C; assuming that t; =ih,i =0,1,2,...,N— 1, 
where h is the spacing between Geiger counter readings. 


(5) A\Another approach to the problem in the previous Exercise of estimating (a, b) 
for an exponential decay process is to take logarithms to get A; = (a/b)[e7" — 
elit] = (a/b) (bh)ye~*" = ah e~""', Taking logarithms gives In A; * In(ah) — 
bt;. Using In C; as our estimates for In 4; we could use a linear fit of In C; against 
t; =ih, and we can use simple least squares for this. However, as bt; becomes 
large, ei and 4; become small and the variance Var [In C;] can become large. 
This can be (partly) compensated for by using a weighted least squares linear 
fit where the data points with small values of C; are weighted less than larger 
values of C;. (Data points with C; = 0 should probably be removed.) Develop 
a method that involves minimizing ae w; (In(C;) — In(ah) — bt;)? where w; 
is roughly proportional to 1/Var [A;] as estimated from C;. 
A cancer test has a false positive probability of 0.1% and a false negative probabil- 
ity of 0.5%. The cancers detected by the test occur in about 2% of the population. 
If a random person taking the test gets a positive result, what is the chance that 
the person actually has one of the cancers tested for? If a random person taking 
the test gets a negative result, what is the change that the person does not have 
any of the detectable cancers? 
(7) A gambler suspects that a tossed coin is biased to come down heads 55% of 
the time instead of 50%. How many coin tosses would the gambler likely to 
need to affirm with probability p < 1% that the coin is unbiased if the coin is, 


(6 


i 
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indeed, biased as suspected? [Hint: Assume that biased coin comes down heads 
for exactly (or nearly exactly) 55% of tosses. ] 


7.4 Random Algorithms 


7.4.1 Random Choices 


Many practical algorithms use random choices. Here we will look at three algorithms 
that use random choices: quicksort, primality testing, and estimating 7. The reasons 
for the random choices differ according to the algorithm. In quicksort, the random 
choices means that the average case performance occurs with high probability. In 
primality testing, a test is applied to the random choices made. If the test for any 
specific choice is negative, then there is no need for any other test. For estimating 
zt, many random choices are made, and averaging is used to obtain a more accurate 
estimate. 

The z estimation algorithm is called a Monte Carlo algorithm. Monte Carlo 
algorithms are inherently random; at no point does a Monte Carlo algorithm have a 
definite answer. For estimating 7, we have only approximate value of mz at any stage 
of the algorithm. On the other hand, the primality testing algorithm can stop with a 
definite result once a single test is negative. This kind of algorithm is called a Las 
Vegas algorithm. The results of the quicksort algorithm, however, do not depend on 
the random choices: the result is the input list, but sorted. Random choices do not 
affect the final result, just the speed of obtaining it; it is neither a Monte Carlo nor a 
Las Vegas algorithm. 


7.4.1.1 Quicksort 


One example is the quicksort algorithm [59, Ch. 8] that first selects an element (called 
the pivot) of a list, and then splits the rest of the list into two sublists consisting of 
elements less than the pivot, and elements above the pivot. Each sublist is then 
recursively sorted. If we choose a fixed element as the pivot element, then for some 
input lists of length n the method takes O(n) comparisons to sort the list, while for 
most input lists the method takes O(n logn) comparisons. If instead, we choose the 
pivot entry randomly and uniformly from the list to sort, the method takes O(n log n) 
comparisons with a very high probability for large n. 


7.4.1.2 Primality Testing 


Another randomized algorithm is a well-known primality testing algorithm. If an odd 
number n is prime then for every 1 < x <n —1,x"~)/? = (x | n) (mod n) where 
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(x | z) is the Legendre symbol. The Legendre symbol has a number of properties, 
the most important of which are 


e forn prime, if (x |) =+1thenx = ae (mod n) for some y, and if (x |”) = —1 
there is no such y, 
e (x |n)=+!1 (mod n) foranyx #0 (mod n), 


e (x |n) - G |n) (mod n) whenever x = z (mod n), and 
e (x | z) = (-1)&-P&YA (Z| x) (mod n). 


Ifn isnot prime, then for half of the integers x between one andn—1,x""~)/? # (x | n) 
(mod n). But we do not know which x have this property. The randomization step 
is to randomly choose x in the range one to n — 1. The value of x"~)/? modulo n 
can be computed in O(logn) multiplications modulo n. The value of (x | 1) can be 
computed in O(log ) arithmetic operations using the above properties of the Legen- 
dre symbol. So if n is not prime, with probability 1/2, x"~)/? = (x | n) (mod n) 
and with probability 1/2 x~/* 4 (x | n) (mod n). The randomized algorithm 
repeatedly picks x € {1,2,...,n — 1} and checks if x"~)/? = (x | n) (mod n). 
This is repeated up to m times, but terminated immediately if x“~)/? & (x | n) 
(mod 7) as this indicates that 7 is not prime. If all of the randomly chosen x we have 
used satisfy x~)/? = (x | n) (mod n), we can claim that n is probably prime. 


7.4.1.3 Buffon Needle Problem 


In 1733, the Comte de Buffon (also known as Georges-Louis Leclerc) posed the 
problem of determining the probability of a needle of length @ crossing a set of 
parallel lines with common spacing s [36]. This probability can be computed to be 
2€/(as) for € < s. Performing this experiment with many “needles” and recording 
the fraction of the needles that cross one of the parallel lines can give approximate 
values for 2. This was supposedly done by the Italian mathematician Mario Lazzarini 
dropping a needle 3408 times to obtain the value 335/113, which is correct to six 
digits [15]. To say that Lazzarini was rather lucky to get such an accurate value of zr is 
something of an understatement. To expect this kind of accuracy would require more 
like 10!” needle drops than a few thousand. However, the method can be implemented 
in software as a Monte Carlo method to estimate zr. 


7.4.2 Monte Carlo Algorithms and Markov Chains 


Markov chains are examples of stochastic processes, random processes where the 
state of the process changes over time. Monte Carlo algorithms use randomness to 
estimate deterministic quantities, usually of the form E[f(X)] where X is a ran- 
dom variable with a prescribed probability distribution. Monte Carlo Markov Chain 
(MCMC) methods use Markov chains to generate samples from a target probability 
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distribution. However, the successive states of a Markov chain are usually not inde- 
pendent. Thus, a Monte Carlo method based on a Markov chain may need to either 
use widely spaced samples from the Markov chain or compensate in some other way 
for the lack of independence. 


7.4.2.1. Markov Chains 


A discrete time Markov chain consists of a set of possible states S and a sequence of 


random variables Xo, X1, X2, ... with the property that for any measurable subset 
E of S, 
(7.4.1) Pr [X41 € E | Xo, Xi, ..., X:] = Pr[X41 € E | X,]. 


That is, the probability distribution of X;,, depends only on the value of X;, and not 
on any previous values X;_,, k = 1, 2, .... The value of X, is called the state of the 
Markov chain. 

If S is also a finite set, then a discrete time Markov chain can be represented by a 
fixed matrix P where p;; = Pr [X41 =i|X,;= i] is the transition probability for 
the transition j > i. Now 


Pr [X41 =i] = ) 0 Pr[Xiyi =i | X= jf) Pr[X: = J. 
jes 


If z,; = Pr[X, =/] for any ¢ and x, = [7,1 |ie€ S| is the vector of probabilities 
of X,, then 
+1 = PH; 


P is the transition matrix for the Markov chain. If e is the vector of ones of the 
dimension of x, then since the total probability is one, e'x, = 1 forallt. In partic- 
ular, 

l= eo 144 =e’ Pn, 


for any probability vector z,; thus e’ P = e”. The other property of P is that each 
entry of P is non-negative since all probabilities are non-negative. Any square matrix 
P of non-negative entries where e? P = e? is called a stochastic matrix, and can be 
the matrix of a Markov chain. A stochastic matrix P where also Pe = e is called 
doubly stochastic. 

In general, we have four possible combinations of continuous and discrete Markov 
chains: discrete time and discrete states, discrete time and continuous state, con- 
tinuous time and discrete state, and also continuous time and continuous state. 
Continuous time and continuous state problems are best understood as stochastic 
differential equations. Continuous time and discrete state Markov chains are linear 
constant coefficient systems of differential equations: 
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Algorithm 70 Simulating a discrete time & discrete state Markov chain 
al generator markovdiscrete(xo, P, U) 
x <— xo 
while true 
u <— next(U) 


2 
3 
4 
5 find y: yes Pix <u< pe Dix 
6 
7 
8 
9 


x<y 
yield x 
end while 

end generator 


d 

Ee ae 

dt 
where a;; < 0,a;; => Oifi A j, and e' A = 0". The matrix A is called the transition 


rate matrix for the Markov chain. Discrete time and continuous state can often be 
represented in terms of integrals for probability distribution functions: 


Tr 1(X) = [ pe. »minnay, 


with the condition that s P(X, y) dx = 1 forall y € S. More generally, probability 
distributions should be represented by measures, in which case the formula should 
be given by 


matey / : weitdy) ue, 93 
EJS 


where jz is a function from S to probability measures on S with the property that 
LCS, y) = 1 for all y € S. We call wv the transition measure. 


7.4.2.2 Simulating Markov Chains 


Computing the probability distributions z, gives a great deal of information about the 
Markov chain, but this is often impractical if the state space is large. For example, if S 
is a discrete state space, then for a discrete time Markov chain with transition matrix 
P we can simulate the Markov chain as follows: given X, = j, we sample X;+1 
from the distribution where X;,,; =i with probability p;;. For S = {1,2,..., N}, 
we can implement this method in Algorithm 70, where U is a generator that uniformly 
samples from [0, 1]. 

Continuous time but discrete state Markov chains with transition rate matrix A 
can be simulated in different ways. One is to pick a step size h > O and then set 
P, = (I —h A)~!. The matrix P, is a stochastic matrix as 
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e’ =e’ =e’ (I —hA)P, = (e’ — he’ A)P, 
= (e? _— h0')P, = e! P,. 


We can then apply Algorithm 70 using P = P, to generate X;,;,,k = 1,2,3,.... This 
approach is closely related to the implicit Euler method for differential equations (see 
6.1.8). An alternative is inspired by the explicit Euler method and set P= 1 +h A. 
In order for P to be a stochastic matrix, we need 1 + ha;; > 0 for alli. Since a;; < 0, 
this puts an upper limit on the value of h. 

Another approach is to identify the transition times: given X,; = j, the tran- 
sition time t is the smallest t > 0 where X,,, = j for all 0 < s < T, but there 
are arbitrarily small € > 0 where X;,,4. # j. The transition time t is a random 
variable and is distributed according to the exponential distribution with parameter 
A = —ajj: tT ~ Exponential(—a;;). Note that aj; < 0so4a > 0. Thenif0 <a < £, 
Pr[r < t <s] = exp(—Ar) — exp(—As). We can sample from this distribution by 
setting tT < —In(U)/A where U is a random variable uniformly distributed over 
[0, 1]. The state X,4,+ is then sampled from S with Pr [Xrpet a i| = ajj/ aor akj 
fori # j. All of these samples can be made independently. If a;; = 0 then t = +00 
and the simulation stops. In this case, the state i is an absorbing state and no transition 
out of this state is possible. 


7.4.2.3 Metropolis—Hastings Algorithm 


The Metropolis—Hastings algorithm gives iterates x, of a Markov chain over a dis- 
crete state space X. The inputs to the Metropolis algorithm are a function g: X > R 
with positive values, and a probability distribution function g(y | x). This Markov 
chain has the property that Pr (x, = x) > q(x)/>o, <x q(x) as k — oo, provided 
the Markov chain is ergodic. A discrete Markov chain is ergodic if the probability of 
any simulation X, of the Markov chain has lim;_,.. Pr [X; = x] > 0 and this prob- 
ability is independent of the simulation. This algorithm is shown in Algorithm 71. 
The original Metropolis algorithm assumed that g is symmetric: 


(7.4.2) g(x|y)=g(y|x)  forallx, ye X. 


Note that g defines the limiting probability distribution for the iterates. How- 
ever, the iterates x; and x, for j #k are not independent. They are, in a sense, 
asymptotically independent in that E [p(x 7) var] _ i [g(x p) {[w(x,)] > 0 as 
|j —k| > o forany gandw: X — R. 

By suitably re-interpeting g and g and the formulas involving them, the algorithm 
can be extended to continuous as well as discrete probability distributions. 
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Algorithm 71 Metropolis—Hastings algorithm; U is a generator of independent ran- 
dom numbers with distribution Uniform(0, 1). 


1 generator metropolishastings(q, g,xo,n, U) 


2 for k=0,1,2,...,n-—1 
3 sample x’ from g(x’ | xx) 
J ! 
4 if next(U) < min(1, aw) s(xe |x) 
qn) g@’! | xx) 
5 Xkey <— x! 
6 else 
8 Xkt1 <— Xk 
9 end if 
10 yield x41 


11 end for 
12 end generator 


Exercises. 


(1) 


(2) 


(3) 


Carry out the Buffon needle experiment with pseudo-random numbers with 
£=s5s. Use N = 10”, m = 2,3, 4,5, “tosses” of the needle to estimate z. Plot 
the error against N using a log-log plot. 

Rather than simulate the entire quicksort algorithm [59, Ch. 8] in this Exercise, we 
generate the number of comparisons needed to perform quicksort given random 
input. For each input list of length n we choose a pivot p, which is randomly 
chosen uniformly from {1, 2, ...,}. The number of comparisons needed to split 
the input list is n — 1. The method then recursively calls itself with sublists of 
lengths p — 1 andn — p — 1. Pseudo-code for this function is given below (g is 
random generator of positive integers): 


function qssim(n, q) 

if n=0 or n=1: return 0; end if 

p < (next(g) (mod n)) + 1 

return gqssim(p — 1, g)+ qssin(n — p—1,g)+(n—1) 
end 


Implement this function and plot the results over multiple runs for n = 10”, 
m = 1, 2,3,4,5. Perform 100 runs for each value of n. How wide is the spread 
of the values returned? The average number of comparisons is usually given as 
O(n logn). Do your empirical results confirm this theoretical result? How far 
does the number of comparisons diverge from this average? 

A simple continuous time birth-death Markov chain model for a population has 
the differential equations for p,(t) = Pr[ P(t) = n] given by 


apn 
dt 
dpo 


—= = 8(1) pj, 
at (1) pi 


= (n= 1)B(n— 1) pn—1 — nIB(n) + 8) pn + MF YS(2+ Vpngt, — forn > 0, 


7A 


(4 


wm 


(5) 


(6) 


(7) 
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where 6(k) is the birth rate for population k and 5(k) is the death rate for pop- 
ulation k. These rates are per individual in the population per unit time. Show 
that the total probability is constant: (d/dt) yea Pn(t) = 0. If B(k) = B and 
6(k) = 8 for all k, show that the average population )° 7 pn(t) grows or 
decays exponentially as Po exp( (B - dt). 

The example of the previous Exercise has a peculiar property: once the population 
reaches zero, it stays at zero. That is, po(t) is an increasing function of t. This 
makes the Markov chain an absorbing Markov chain: there is a subset So of the 
states S where the probability of transferring from any state in So to any state in 
S; := S\So is zero. For a continuous Markov chain, the differential equation for 
the probabilities has the form 


ells 
dtl Pil | Ai} L Pil) 
Show that e” po(t) never decreases ((d/dt)(e? po(t)) => 0) and e? p,(t) never 
increases. Also show that there is an eigenvalue 4, > Oand an eigenvector p, with 
non-negative components where — A,;P, = /4P,- (Hint: For the last part, Aj, 
is a matrix with negative diagonal entries and non-negative off-diagonal entries. 
Writing —A;,,; = D — N, note that dj; > ae nj;j so for any € > 0, € — Aj; 1s 
a diagonally dominant matrix with positive diagonal entries and non-negative 
off-diagonal entries, so (e7 — Aj,;)~! exists and has only non-negative entries. 
The dominant eigenvalue of this inverse gives approximations jz, that converge 
towase J 0.] 
Suppose that G = (V, E) is a connected undirected graph with transitions 
between vertices x — y having probability py, = 1/deg,(x) at each step for 
each y directly connected to x by an edge (y ~ x). Show that the equilibrium 
probability distribution over the vertices of G is given by 7, = degg(x)/ 0. 
degg(z) = degg(x)/(2|E)). 
Considering the previous Exercise, what should be the transition probabilities 
to ensure that the equilibrium probability distribution is uniform (77, = 1/ |V|)? 
[Hint: Transitions x +» x should be allowed and given a suitable transition 
probability. ] 
Simulate a continuous-time Markov chain model of an infection in a population: 
suppose everyone is either susceptible (S), infected (1), or recovered (R). Let 
s be the number of susceptible individuals, i the number of infected individ- 
uals, and r the number of recovered individuals. Assuming either population 
gain or loss, s +i +r = N, the total population sor = N — s — i. The transi- 
tions are: (1) infected person recovers (s,i,7) b> (s,i — 1,7 + 1); (2) suscepti- 
ble person becomes infected (s, i, r) FH (s — 1,i + 1,1); and recovered person 
becomes susceptible (s,i,7) > (s + 1,i,7 — 1). Transition (1) has a rate a i; 
transition (2) has a rate 6 si; transition (3) has rate yr = y(N — s —i). Use 
a=107', p= 10-2, andy = 10-3, with N = 100. Initially, assume all but five 
individuals are susceptible, and five are infected. Note that this is an absorbing 
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Markov chain as once the number of infected people becomes zero, no more 
infections can occur. The time step needed for numerically solving this Markov 
chain may be quite small, especially for an explicit method: At ~ 107? perhaps. 
Try repeating this example with N = 1000. 

(8) A\The Google PageRank algorithm is based on the idea of a Markov chain 
arising from the network of web pages. Each state of the PageRank Markov 
chain is a web page. Every web page is linked to zero or more other web pages. 
Provided there is at least one outgoing link on a given web page, the transitions 
from this web page are to all web pages linked from the current web page. Each 
link is equally likely to be chosen; this gives the transition probabilities. If a 
given web page has no outgoing links, then the transitions from this web page 
are to every web page, with each web page having equal probability. These rules 
define a transition matrix P, which is a sparse matrix plus a rank-1 matrix (for 
the web pages with no outgoing links). The actual transition matrix used in the 
PageRank algorithm is Py = (1 — a)P + wee’ /N where e = [1,1,1,...,1]7 
and N is the total number of web pages. Show that P, is a transition matrix for a 
Markov chain provided 0 < a < 1. The actual PageRank algorithm is to obtain 
the equilibrium probability distribution z for the Markov chain represented by P, 
(usually with a ~ 0.15) and to rank order any set of web pages found in a search 
in decreasing order of zt, for web page k. The equilibrium distribution is found 
by means of the power method: m“+) — P,r®, t =0,1,2,.... Implement 
the PageRank algorithm and apply it to a network of your own devising or to a 
network obtained by scraping the World Wide Web. 
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Stochastic differential equations combine randomness with differential equations, 
and if they are autonomous (in the sense that time does not appear explicitly in the 
equation) then these form a continuous time and continuous state Markov chain. A 
stochastic differential equation, in general, has the form 


(75.1) dX, = f(t, X,)dt +o(t, X;)dW;. 


Here W, represents a Brownian motion, named after Robert Brown who in 1827 
observed pollen particles! under a microscope, moving in water in a random and 
jittery way without apparent cause. The cause was in fact the collisions of water 
molecules with the pollen particles. The frequent collisions with water molecules 
push the pollen particles along a random walk with frequent small steps. 


' Brown was actually observing organelles of the pollen moving, not the pollen grains themselves 
moving. 
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Stochastic differential equations are the subject of @ksendal’s book [191]; numer- 
ical methods for stochastic differential equations are the subject of the book by 
Kloeden and Platen [144]. 


7.5.1 Wiener Processes 


Our understanding of Brownian motion was developed by Einstein, Smoluchowski, 
and Norbert Wiener. It was Wiener who put it all on a rigorous foundation. For this 
reason we often refer to Wiener processes. The first property of Wiener processes is 
that it is an independent increments process: that is, fora < b < c the random vari- 
ables W. — W, and W, — W, are independent. The second property is translation 
invariance: W, — W, has the same distribution as Wy, — Wo. The third property is 
that W, — W,, has finite variance. Because of the independent increments property 
and finite variance, 


Var [W. — Wa] = Var [(W. — Wp) + (Wy — Wa)] 
= Var [W,. — W,] + Var [W, — Wa]. 


Combined with translation invariance, we see that Var [W, — W,] must be pro- 
portional to c — a. The final property is that Var [W,; — Wo] = /. Then Var [W.— 
W.] = (c —a)/. It is a standard Wiener process if, in addition, Wp = 0 with prob- 
ability one. 

The Central Limit Theorem (Theorem 7.4) implies that each W, and difference 
W, — W, must be normally distributed. 

We can create approximate Wiener processes by selecting a time-step h > 0, and 
then take W;,, to be the linear interpolant over t of Wain, kK = 0, 1,2,... with 
Wao = Oand Wa ettyn = Wan + Vh vA where Zz ~ Normal (0, J) is an inde- 
pendent increment. The trouble with this approach is that we cannot take meaningful 
limits as h | 0: unless we have some relationship between Zz” and Z a for exam- 
ple, we cannot expect the W;,, to converge ash | 0. 

To see how to create a convergent sequence W;, «;,, we focus on the scalar case. 
The vector case can be dealt with by treating each component independently. 

Start with the values W;,9 = 0 and W;,; ~ Normal(0, 1) and “fill in” the values 
between ¢ = 0 and t = 1. For h = 1/2 we set Wi/2,.9 = Wi,9 and Wi,2,; = Wi,1. 
However, we need to determine a value Wj/2,1/2 so that W1/2,1/2 — Wi/2,0 and 
Wi /2,1 — W1/2,1/2 are independent and are both distributed as Normal(0, 1/2). We 
generate an independent normally distributed sample U re eee Normal(0, 1/4) and 
set Wi/2,1/2 = 5(Wo.0 + Wii) + ees We chose gee to have variance 1/4 so 
that the variances add properly: 


1 1 1 
5 = Var [W1/2,1/2] = Var | 5(o0 + m0] + Var [uf] = ri + Var [ue] : 
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Since (W1/2,0, W1/2,1/2, Wi,1/2) are jointly normally distributed, to show indepen- 
dence of the increments it suffices to show that they have zero correlation. Since 
all means are zero, this reduces to showing that E [(Wi2,1/2 — Wi 2,0) (W1/2,1- 
Wi/2.1/2)| = 0: 


E[(Wi2,1/2 _ W120) (Wij21- W1/2,1/2) | 
1 1 
=E [Gaz — Wii) + uf) (Som — Wi) - ui") 


1 27 1 
= | F%0 Wi)? (uy) = = Var | uj! ] =0. 


Also note that the variances Var [Wi 2.1/2 _ Wi/2.0] = Var [W121 — Wi/2,1/2] = 
1/2. 

In general, suppose that we have constructed Wy nx fork = 0,1,2,...,2” and 
H =2-". Now forh = 2~”~! we wish to construct Wy. n¢ for€ = 0, 1,2,...,2”7!. 
For even ¢ = 2j we set Wino; = Wu,;. For odd € = 2j + 1 we set Wi ngjat = 
$(Wu. nj + Wa acj+y) + u* where oP as an independent sample of the 
Normal(0, 4h) distribution. The arguments used above for the case H = | andh = ; 
can be applied here: the random variables (W),.n2;, Wnr,ncj+i), Wh,n2j+2)) are jointly 
normally distributed, and 2 [ (Wa naity = Wh n2 Whn(2j-+2) = Whn(2j+1) | = 0. 
This shows that the increments Wy n(2j+1) — Wa,n2j and Wh n2j42) — Wa,ncj+iy are 
independent. Standard calculations show that Var [Wine i+) — Wan i = 
Var [ Wa,naj+2) — Wa.noj+1] = h. 

Assuming that for integers0 < p <q <r < 2” the increments Wy,4, — Wu,np 
and Wy,4, — Wu,Hq are independent with variances H - (g — p) and H - (r — q), 
respectively, then we can show the corresponding result for W,,.;¢, 2 =0,1,2,..., 
2+! using the independence and variance results of the previous paragraph. 

We now have a sequence of functions t > W,,, for h = 27",m=0,1,2,..., 
each of which has the property that for h = 2~” and 0 < p <q <r the incre- 
ments Whjng — Warp and Wa.p~r — Wh,ng are independent with variances h(g — p) 
and h(r — q), respectively. Furthermore, if H = 2~* for some positive integer s, 
then for h = 2~” with m > s we have Wy, 4; = Wh,nj aS Hj = h2— j and the 
construction used. 

It can also be shown that the limit t +> W, is a continuous function; in fact, with 
probability one, W, is Hélder continuous with exponent a for any 0 < a < 1/2. In 
fact, we do a little better. Lévy [161] was able to show that for any T > 0, with 
probability one, 


4 |W, a Ws | 
(7.5.2) lim sup sup —__ 
€{0 = |s—t|<e, s,te[0,1] 2e€ In(1/e) 


A modern proof of this result can be found in [25, Thm. 10.2, p. 70]. However, 
with probability one W, is not differentiable anywhere, nor does it have bounded 
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variation’. This means we cannot interpret integrals iE g(t) dW, as Lebesgue— 
Stieltjes integrals. This has an impact on how we can interpret and numerically 
solve stochastic differential equations. 


7.5.2 Ito Stochastic Differential Equations 


Interpreting an ordinary differential equation (ODE) 


d 
_ =f(t,x(t)), x(t) = x0 


can be done in terms of integrals: 


t 
x(t) = Xo + f(s, x(s))ds forall t > fo. 
i) 


Interpreting the stochastic differential equation (SDE) 
dX, = f(t, X,)dt+o(t, X,) dW, 


in the sense of Ité in terms of integrals 
t 
X,=X,, +f [ f(s, Xs) ds +oa(s, X;)dW,]| 
to 


is a more difficult task as we need to interpret the integral i o(s, X;)dW, which 
involves products of random quantities. A summary of It6 calculus is [191]. 

A specific example we can consider is i W, dW,. If we used standard methods 
from calculus, a change of variable u(s) = 5We would give du = W, dW, and so 


W, dw, = 5W, pe = 5W?. Note that its expectation is st. On the other hand, 


if we approximate the integral by the sum 


n—-1 


> Wak Wns — Wak) 
k=0 


where tf = nh we get a random quantity whose expectation is zero: Wac+1) — 
Worx is independent of Wrz = Wrz — Wo by the independent increments property, 
so E [War (Wiest) - Wit) | =E[Wixl]E [Wr +1) = Wi | = 0. Taking the limit as 


h > O gives E i W, dw, | = 0, not St. 


? Having bounded variation V over [a,b] would mean that ~S |W, at — W,;| < V for any 
sequence a < tg <t) <-+- <t,-1 <t, < b with any integern > 1. 
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This difference can be explained in terms of the It6 formula [191, Lemma 4.2.1]: 


Theorem 7.7 If dX, =u(t)dt+oa(t)dW, in the It6 sense and Y, = g(t, X;) 
where u and o are continuous and g has continuous second derivatives, then 


a 1 
dY, = 5 tt X,)dt+Ve(t, X;)' dX; + 5 0X; Hess git, X,) dX, 


ag T 1 T 
= Ew X;)+ Ve(t, X:)° u(t) + a irace (a(t) Hess g(t, x) 0)| dt 


+ Veg(t, X;)'o(t)dW;. 
(7.5.3) 


The It6 formula applied to 5W? yields d(5W7) = W, dW, — 4dr; integrating 
then gives i. W,dW, = 5W? a $t which has expectation zero. 
The integral is o(s, X;)dW, in the It6 sense is 


n—1 


(7.5.4) Jim Yo (hk, Xie) [Wiest — Wr] where h = t/n. 
k=0 


Solutions exist and are unique provided both f(t, x) and o(t, x) are Lipschitz in 
x, uniformly in f, given the initial value X,, = xo [191, Thm. 5.2.1], which can be 
shown via Picard iteration and the It6 isometry [191, Lemma 3.1.5]: if g(t) is a 
random variable that is continuous in f, 


b 2 b 
(7.5.5) AC ols) aw, = | f etsy as]. 


Given the underlying Wiener process W;, we have existence and uniqueness 
of solutions of dX, = f(t, X;)dt+o(t, X,)dW, with X, =x o. This is what 
we need for simulating a stochastic differential equation. However, for determin- 
ing the probability distribution of X,, we need a different equation. If p(t, x) is 
the probability density function of X;, we can obtain a differential equation for 
p(t, x). We can do that by using the It6 formula (7.5.3): for any smooth g(x), if 
dX, = f(t, X;)dt + a(t, X,) dW, then 


d 1 
qe lsXnl=E becom X;) + 5trace (ot, X,)" Hess ¢(X1) a(t, x») | 
= i [vec se.2) + trace (207.291 Hess (x) o(4,29)] p(t, x) dx. 
Re” 


Integrating by parts assuming that p(t, x) — 0 rapidly as ||x|| — oo gives 
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d 1 a? 7 
SE lg(X1)] = [, - (PUL *)F(E.)) + 5 > ae (pt, x)o(t, x)o(t, x) ] g(x) dx. 
On the other hand, 
d 


Setexo= = | (t,x) g)d =i Pe )g(x)d 
de ea ee Se en 


Equating the two gives 


op 7 
i gy bt) 8) dx = 


; 1 a? T 
[, ; - (POOF OD) +5 a (pit, x)o(t, x)o(t, x)") g(x) dx 


= ij 
inj 
for all smooth g. Then we can conclude that 
(7.5.6) 
0 : 1 
5p 8) = div (DELS) +5) 


ij 


T 
rs i; (p(t, x)o(t, x)o(t, x) ire 


which is the Fokker—Planck equation for the stochastic differential equation. 


7.5.3 Stratonovich Integrals and Differential Equations 


The It6 integral is based on approximations by finite sums of the form 


n—-1 


b 
/ F(X) dY, © Yo f (Karen) Yararnn — Ya+kn). 


k=0 


As was noted in Section 7.5.2, the stochastic integral ie W, dW, has expectation 
zero when interpreted in the sense of It6. There is another interpretation called the 
Stratonovich integral, which is based on the trapezoidal rule: 
(7.557) 
b n—-1 
F(X) od¥, © D1 5 [F(Kasin) + fXararn)] Horaryn — Yavin): 
a k=0 


Because of the nature of Wiener processes, the limits as n — oo are different for 
integrals like 
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n—-1 


1 
[ W, odW, * 23 (Waceety + War) Wrest — War) 


1 1 
a 5 Wier — Wi) = 5(W, — Wo), 
k=0 


matching the naive application of standard calculus rules. 
Stratonovich stochastic differential equations have the form 


(7.5.8) dX, = f(X;)dt+a(X,) odW,;, Xo = Xo. 


If integrals are interpreted in the Stratonovich sense, 


t t 
X;=Xo+ fxoas+ f o(X,)odW,, 
0 0 


then the solutions are solutions in the sense of Stratonovich. 
For scalar stochastic differential equations, the Stratonovich equation 


dX, = f(X,) dt + 0(X;) odW, 


is equivalent to the It6 equation 
1 
dX, = (40% + 50(X,)0"(X,)) dt +0(X,)dW,. 
For the vector Stratonovich equation 


dX, = f(X,)dt+o(X;)odW,, Xo =Xo, 


the equivalent It6 equation is 


dxX,= ye ei where 


c(x) = or 5 He. 


k=1 j=l 


It6 equations can also be represented as Stratonovich equations by replacing f(X;) 
with f(X,) — 5¢(X;) and replacing 0 (X;) dW, by o(X,;) odW;. 
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7.5.4 Euler-Maruyama Method 


The Euler—Maruyama method is essentially the Euler method applied to the stochas- 
tic differential equation 


dX, = f(t, X,)dt + a(t, X,)dW,, Xo =Xo. 


For a step size h > O, the method consists of the iteration 


sh sh sh sh 
(7.5.9) Xie = X ix +h f(t, X nx) + a (hk, Xi) (Wriksry _ Wok) 
where the numerical solution is ca Xin, kK =0,1,2,..., with initial value 
Xo = Xo. 


An example of the result of the Euler-Maruyama method is shown in Figure 7.5.1. 
This shows solutions for 


dX,=rxX,dt+sX,dwW,, Xj = 1, 


withr = s = 1. This is a model for price evolution with a natural interest or inflation 
rate of r and volatility s. The trajectories in Figure 7.5.1(a) use the same underlying 
Wiener process. The details, in case you want to reconstruct this solution, are as 
follows: the approximate Wiener process was created using Matlab’s randn function 
based on the Mersenne Twister generator with seed 95324965 to generate 2”° pseudo- 
random distributed according to the Normal(0, 2-9) distribution; then cumulative 
sums were used to give the values of W,, with h = 2-29 andk = 0, 1,2,..., 2. 

The convergence of the trajectories, rather than just the statistical properties of 
the trajectories, illustrates strong convergence of the approximate trajectories for this 
method. That is, there are positive constants C and hy where 


(7.5.10) 


= sh 
toe 


max sce for allO <h </o. 
O<hAk<T 


Weak convergence means that for any smooth function g(x), 


(7.5.11) pmax, E [(®j.) 7 (Xn) =O(h*) ash JO. 


Figure 7.5.1(b) shows the corresponding maximum errors. Note that the dashed line 
in Figure 7.5.1(b) is the plot of //h. 

Convergence theory for the Euler-Maruyama method is developed in, for exam- 
ple, Kloeden and Platen [144, Thm. 10.2.2]. The error is on average O(Vh), and 
O(vh Inh) with probability one. 
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Fig. 7.5.1 Results of the 18 J 
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(B) Error behavior 


7.5.5 Higher Order Methods for Stochastic Differential 
Equations 


Creating higher order methods for stochastic differential equations is harder than 
creating higher order methods for ordinary differential equations, or even partial dif- 
ferential equations. Part of the problem is that multiple integrals of Wiener processes 
arise, which require special treatment [143]. However, higher order methods can be 
created using Taylor series and Runge-Kutta approaches [42, 144]. The Taylor series 
approach involves further random quantities obtained from a given Wiener process. 
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The Milstein method [143, (4.2) on p. 291] for a scalar SDE in It6 sense is 
(7.5.12) Khairy = Xi, + f(D) A+ 0 (XH) (Wiest) — War) + 


1 / 
4s 50 (Xiu )o (XI) [Wieser — Wak)” — A]. 


This method is strongly convergent with order 1, so that the expected global error is 
O(h). There is also an implicit version of the Milstein method 


(7.5.13) Xhacey = Xia f Siar) 2+ o Xi) Whee — War) + 


1 / 
+ 50 (Xie (XH) [(Waactty — Wie)? — A], 


which will also strongly convergent with order 1. Higher order Taylor series methods 
involve higher order derivatives of f ando. 

Runge-Kutta methods for stochastic differential equations are given in Burrage 
and Burrage [40]. Burrage and Burrage first mention explicit s-stage stochastic 
Runge-Kutta methods of the form 


(7.5.14) Yic=Ryn th ayfWep+ > byoWepdi, §=1,2,...,5, 


j=l j=l 
(7.5.15) 
Ay sy S S 
Xiasy = Xn +h Sas f (i) + So Bio Vi iJ- 
j=l j=l 


Here J) = Wace+1) — Wax, which is the first stochastic integral: J; = pr dW,. 
As noted in [40], methods of the form (7.5.14, 7.5.15) cannot have strong order greater 
than 1.5. That is, (7.5.10) cannot hold with a > 1.5, which was shown by Riimelin 
[221]. This order is, in fact, obtained by the stochastic version of Heun’s method: 


sh sh sh 
Y= X ink +h f(X),) +o0(X,,) Ji, 

sh sh 1 sh 
Xn = Xn + ah een F rs] 


1 + 
+ 5hlo(Xp) oWe) | Ji. 


This method can also be represented by the extended Butcher tableau 
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The key to higher order than 1.5 is to incorporate nested stochastic integrals of 
Wiener processes. The Milstein method (7.5.12) achieves strong order one instead 
of 4 5 by using the nested It6 integral io Si AW AWs = (Wicket) — War)? — 
Bumase and Burrage [40] use nested Stratonovich and Riemann integrals, such as 


A(k+L) 
ioe fi iA LodW, odW, = (Wasi) — Wax), 
hk hk 


h(k+1) 
Jou= f [ 1 drodwW,, 
hk hk 


h(k+1) 
Joo =f [ ia lodW, dr ds, etc. 
hk hk Jhk 


The latter nested integrals cannot be written in terms of W,, and W),%+1). Instead, 
they depend on the path taken between these values. If the underlying Wiener process 
can be computed to higher resolution, then these nested integrals can be computed. 
Alternatively, samples of appropriate probability distributions can be computed for 
the values of the nested integrals. Then the resulting numerical solution is an accu- 
rate approximation to the true solution for some underlying Wiener process with the 
given values of W,x,k = 0,1, 2,.... 


Exercises. 


(1) Solve the stochastic differential equation for a pendulum 


d0=adt 
dos —5 sind dt +0 dW, 


with g = = lando = 10-7 using the Euler-Maruyama method with step size 

At = 10~? over the time interval [0, 100]. Use initial conditions 6(0) = a /3 and 

@(0) = 0. Compare your results with the deterministic equation (o = 0) with 

the same initial conditions, step size, with the Euler method. 

For the previous Exercise, write out the Fokker—Planck equation (7.5.6) for the 

stochastic differential equation. Remember the Fokker—Planck equation is for 

the probability density function p(t, 6, w). 

(3) Repeat Exercise 1 for the van der Pol equation (6.2.14) with the same parameter 
values except At = 10-3, and with jt = 2. Since the deterministic solution has 
a strongly stable limit cycle, the solution of the stochastic version with small 
o should remain close to the deterministic trajectory, although the stochastic 
solution may be delayed or advanced in time. Check if this is true for your 
solution to the stochastic equation. 


(2 


a 
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(4) Compare numerical solutions of the stochastic differential equation dX; = 
X,;dt + X,dW,, Xo =1 over the time interval ¢ € [0,1] using the Euler— 
Maruyama (7.5.9) and the Milstein methods (6.2.14) with step sizes At = 2-/, 
j =1,2,..., 10. Be careful to use the same sample W, of the standard Wiener 
process. For the “exact” solution, use the Milstein method with At = 2-!6 Plot 
the maximum error in X; overt € [0, 1] against At for each of the two methods. 
Use a log-log plot, at least at first, to see the order of convergence. 

(5) A\In Exercise 7.4.7, a Markov chain model of an infectious disease is given. 
The state space consists of points (s,i,r) € {0, 1, 2,.. AP withs +i+r=N 
(the total population) and three different kinds of transitions. If we take the 
expectation of the change in (s, i, r) and ignore the fact that (s, i, r) are integers, 
we obtain the “mass action” differential equations 


ds _ a 
ieee a yr, 
di ; ; 
Pee hia 
dr ii 
—=-yr Qi. 
dt 7 


However, this ignores the stochastic aspect of the process. In time Af, for small 
At, the transitions are (s,i,7r) > (s,i —1,7+ 1) with probability ai At+ 
O(At)’, (s,i,r)  (s —1,i+1,r) with probability 6 si At + O(At)’, and 
(s,i,r) + (s +1,i,r —1) with probability y r At + O(At)?. The expected 
value of the changes is 


U[((s(t + At), it + At), rt + At)) — (6), iC, r@))/At] 


goes to the right-hand side of the above differential equations as At — 0. Show 
that the variance—covariance matrix of the change 


Var [((s(t + At), i(t + At), r(t + At)) — (s(t), i(t), r(t))] ~ At V as At | O where 


0 ay 247) Pe ales aie 
Vis,i,r) =ai 1 1 + Bis | +1 +1 tyr 0 0 : 
+1 +1 0 0 -1 -1 
T 


Set o(t, x) to be a matrix where o(t, x)o(t,x)’ = V(x) with x =[s,i,r]’. 
The stochastic differential equation that results is 


Si —BS I; +y R: S; 
d\| I, | =| +8S,l,-al, |dt+o(t, X,)dW,, X,=| 
R, -yR, +al R; 
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Solve this stochastic differential equation using the Euler-Maruyama method 
using the parameter and initial values in Exercise 7.4.7. Compare the results to 
simulating the Markov chain of Exercise 7.4.7 directly. Note, however, that we 
can only expect statistical similarity over many runs. 

(6) Since chaotic differential equations are persistently unstable, any stochastic per- 
turbation (even if very small) will result in large changes in the solution. Consider, 
for example, the Lorenz equations modeling weather (6.1.13—6.1.15), but con- 
vert them to a stochastic differential equation by adding a term “o dW,” with 
o = 10~*. Solve with the same initial conditions as for Figure 6.1.2 for a time 
interval of length 100. Compare the results with the deterministic solution (for 
o =0). 


Project 


A way of generating text is to use a Markov chain. For text that sounds more like 
English, given a body of text we want to resemble, split the text into words. Choose 
a positive integer r, and for each r-tuple (wy_,41,.--, We-1, We) Of words, assign 
a given word w a transition probability of the number of occurrences of word w 
immediately after words wz_;+1,..., We—1, Wz in the body of text, divided by the 
total number of occurrences of wy_;+41,..-, We-1, We in the text. This gives the 
transition probability (wz_;-41,..., We-1, Wk) > (We-r42,--+, We, W) ina Markov 
chain. If (wz_,41,-.-, We-1, Wx) does not occur in the body of text then assign a 
transition (wx_;41,.--, We-1, Wk) > (We_r42,---, We, W) Where w indicates the 
end of a sentence. Implement this in a suitable programming language (hash tables 
are good for storing transitions). Pick a suitable value of r (say, 2 or 3). Use some 
body of text to generate the transition probabilities (Shakespeare is in the public 
domain at http://shakespeare.mit.edu/, for example). Run the Markov chain to create 
“text”. 


Chapter 8 ®) 
Optimization ra 


Optimization is the task of making the most of what you have. Mathematically, this 
is turned into finding either the maximum or minimum of a function f: A > R 
where A C R” is the feasible set, the set of allowed choices. Since maximizing 
f (x) over x € A is equivalent to minimizing — f(x) over x € A, by convention we 
usually consider minimizing f (x) overasetx € A. After reviewing the necessary and 
sufficient conditions for optimization, we turn to numerical methods. Of particular 
importance is the distinction between convex and non-convex optimization problems. 

This chapter can only scratch the surface of the wide range of optimization meth- 
ods. Nocedal and Wright [190], for example, go into most of the topics covered here 
in much more depth. Kochenderfer and Wheeler [146], on the other hand, delve into 
many more algorithms than here but in less depth. 


8.1 Basics of Optimization 


8.1.1 Existence of Minimizers 


Of course, there might not be a minimum. The function f(x) = x over A = R does 
not have a minimum; this function is unbounded below: for any L € R there is an 
x € Rwhere f(x) < L. Other functions, such as f(x) = e* over A = Ris bounded 
below (f(x) > 0 for all x € R), but there is no minimum. Instead, we can make 
J (x) approach zero as closely as we please, but there is no x where f (x) = 0. So the 
infimum of f (x) = e* over x € R exists and is zero, but since the infimum cannot 
be attained, there is no minimum. 

A variation of this idea is that if A = (0, 1), the open interval of x € R where 
0 <x <1, the minimum of f(x) = x over x € A does not exist either. Again, we 
have a situation where the infimum of f(x) over x € A exists (it is zero) but cannot 
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be attained as this would require using x = 0, which is not in A = (0, 1). Similarly, 
if we have a function that is not continuous, say f(x) = x for x > 0 but f(x) = 1 
for x < 0, then again we have an infimum of f(x) over x € R which is finite (again, 
it is zero), but we cannot attain this value, because 1 = f(0) > limyjo f(x) = 0. 

We can show existence of a minimizer if f: A — Ris continuous and A C R” is 
both closed and bounded. Since A is closed: ifx, € A fork = 1,2,...andx, > x 
as k — oo, then x € A as well. Also, since A is bounded: there is a bound B where 
|x|| < B for all x € A. Since A is a subset of R”, the combination of being closed 
and bounded implies that A is compact, meaning that for any sequence x; € A for 
k =1,2,..., there is subsequence Xe J = 1,2,... with kj.) > k; forall j, which 
converges: x;, > ¥ as j > oo for some X € Al. 


Theorem 8.1 (Heine—Borel theorem) Jf f : A — R is continuous and A compact, 
then f has both a minimum and a maximum over A. 


Proof If f is unbounded below over A then there is a sequence x, € A where 
f (xx) > —coas k — oo. By the compactness of A, there is a subsequence x, with 
kj41 > kj wherex,, > Xas j > oo. Bycontinuity of f, f(¥) = limjoo f(xx,) = 
—oo, which is impossible. 

So there must be a lower bound L where f(x) > L for allx € A. By the greatest 
lower bound principle of real numbers, there must be a greatest lower bound L 
where L < Z, < f(x) for all x € A. Since for any positive integer n, jee 1/n is 
not a lower bound, there must be a point x,, € A where iL < f(Xn) < L + 1/n. By 
compactness, there is a subsequence x,, where X,, > x € Aas j — oo. Continuity 
of f thenimplies that f(x,,) > f (x) as j —> 00; the Squeeze Theorem then implies 
that L = f (<) < f(x) for any x € A. That is, ¥ € A minimizes f(x) overx € A. 

To show that f has a maximum over A, apply the above argument to — f. 


Often we have to deal with the situation where A = R"”, which is unconstrained 
optimization. In this case, Theorem 8.1 cannot be applied because A = R” is not 
bounded. Instead we need a suitable condition on f. We say f is coercive over A if 
x, € A for all k and ||x;|| — oo ask — oo implies that f(x,) — oo ask > oo. If 
A = R" we simply say that f is coercive. 


Theorem 8.2 If f: A — R is continuous and coercive, and A C R" is closed and 
non-empty, then f has a minimizer over A. 


Proof Pickapointx9 € A.Let Ap = {x € A| f(x) < f (x0) }. The set Ag is called 
a level set. We will show that Ag is both closed and bounded in R”. Suppose that 
XxX, —> x ask — o and x; ©€ Ag for every k. Then x; € A and f(x;,) < f (Xo) for 
every k. Since x, — x as k > oo and A is closed, x € A. Since f is continuous, 
F(x) = limg oo f (xx) < f (Xo). That is, x € Ag, and so Ag is closed. 


' Technically, what is described here is sequential compactness. The proper definition of compact- 
ness is best defined in terms of topologies. See, for example, [184, Chap. 3]. A set being sequentially 
compact is equivalent to being compact in any space with a norm or metric. 
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To see that Ap is bounded, suppose otherwise. We will show this leads to a con- 
tradiction. If Ap is unbounded, then for every n there is an x, € Ag where ||x,|| > 7. 
Clearly, ||x,,|| — oo asn — oo. As f is coercive, f(x,) — oo. Therefore, there is 
an N wheren > N implies f(x,) > f (Xo). This violates the definition of Ao, which 
is the contradiction we seek. 

Thus Ag is both closed and bounded in R’”. It is therefore compact, and so there 
must be a minimizer ¥ of f over Ag. This minimizer in fact minimizes f over all 
of A: X9 € Ag by definition of Ap. Since ¥ minimizes f over Ao, f(X) < f (x9). If 
x € Abutx ¢ Ao, then f(x) > f(xo) => f(&). Thus for any x € A whether x € Ag 
or not, f(x) > f(¥) and ¥ minimizes f over A. 


Telling if a function is coercive can be easy in some cases, like f(x, y) = x? + 2y’. 
Or it can be more challenging, such as for f(x, y) = e* — x + y+ — xy*. We can use 
comparison principles to tell if a function is coercive: if g is coercive and f(x) > 
g(x) for all x, then f is also coercive. If f(x, y) = g(x) +A(y) with g and h 
both continuous and coercive, then so is f. Sums of coercive functions are also 
coercive. Products of coercive functions are coercive. Products of coercive functions 
and positive constants are also coercive. Some simple rules like |ab| < $(a? +b?) 
can also be really helpful. 


Example 8.3. We can show that f(x, y) = e* — x + y* — xy? is coercive as follows 
(start by applying the rule |ab| < 5 (a? +b?) to xy’): for x > 0, 


f@ Woe —x+ yiaxy 
1 
EM ae ey) 
Rene 


ae ee ee ~ 4,4 
=e x aX tay: 


For x < 0, 


fy) =e —x+y*— xy? 
Se axa" 
1 


4 
> ~y*, 
> ane 


Now h(y) = cyt is clearly coercive in y; g(x) = max(e* — x — $x, —x) is coer- 


cive in x. Note that f(x, y) > g(x) + h(y) for all (x, y) so f is also coercive. 


8.1.2 Necessary Conditions for Local Minimizers 


What we have been calling the minimizer is also called the global minimizer or abso- 
lute minimizer. This distinguishes it from a local minimizer: X is a local minimizer 
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of f if there is ad > O where ||x — X|| < 6 implies f(*) < f(x). We also say that 
f (x) is a local minimum. We say * is a strict local minimizer if there is a 6 > 0 
where 0 < ||x —X|| < 6 implies f(x) < f(x). In this case, we say f(X) is a strict 
local minimum. 

We use tools from calculus to find local minimizers. But telling if a local mini- 
mizer is also a global minimizer takes extra information. We can look for all local 
minimizers. As long as a global minimizer exists, one of the local minimizers will 
also be a global minimizer. We simply need to look at the function values at each 
local minimizer: the minimum of these local minima is the global minimum, again, 
provided a global minimizer exists. 

The usual rule from calculus is that if x* is a local minimizer of a function 
f: R= R, then f’(x*) = 0. We can extend this to the multivariate case. 


Theorem 8.4 (Fermat’s principle) Jf f: R’ — R has continuous first derivatives 
and a local mimimzer x* then V f (x*) = 0. 


To be clear, Fermat enunciated his principle for a single variable in 1636 [73], 
before either Newton or Leibniz had developed calculus. 


Proof Let d € R" and take s > 0. Assume that 6 > 0 where ||x — x*|| < 6 implies 
that f(x) => f(x*). Then provided |s| ||d|| < 6 we have f(x* + sd) > f(x*). For 
s > 0, subtracting f(x*) and dividing by s gives 


f(x* + sd) — f(x*) . 


5 


Taking the limit as s | 0 gives V f(x*)’d > 0. For s < 0, subtracting f(x*) and 
dividing by s gives 
f(x* +d) ~ f(x) _ 


5 


Taking the limit as s + 0 gives V f(x*)’d < 0. So 0 < Vf (x*)"d < 0; that is, 
V f (x*)'d = 0. Since this is true for any d, Vf (x*) = 0. 


Any point x where V f(x) = 0 is called a critical point. If we can find all critical 
points, we can determine which gives the smallest function value; that critical point is 
the global minimizer, provided a global minimizer exists. But the condition V f(x) = 
0 does not imply that x is even a local minimizer. In calculus, we look for second 
derivatives: f" (x) > 0 is a necessary condition for a local minimizer. Generalizing 
to the multivariate situation, we have the following theorem. 


Theorem 8.5 Suppose that f : IR" — R has continuous second order derivatives. 
Suppose also that x* is a local minimizer. Then V f (x*) = 0 and Hess f (x*) is 
positive semi-definite. If, however, V f (x) = 0 and Hess f (¥) is positive definite, 
then x is a strict local minimizer. 


Proof Suppose that f has continuous second derivatives and x* is a local minimizer. 
Then using Taylor series with second-order remainder we have for any d, 
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1 
f(x* + sd) = f(x*)+ Vf (x*)(sd) + 5 (sd)" Hess f (x* +. s'd)(sd) 


for some s’ between zero and s. By Theorem 8.4, V f(x*) = 0. For any s 4 0 suf- 
ficiently small, f(x* + sd) > f(x*) as x* is a local minimizer. Then subtracting 
f (x*) and dividing by s? gives 


ge ere} 


7 = —d"Hess f (x* + s'd)d. 
Ss 2 


Taking s > 0 we get 
1 
0< 54 Hess f(x*)d 


by continuity of the second derivatives. Since this is true for all d, Hess f(x*) is 
positive semi-definite. 

To show that Hess f (X) positive definite and V f (¥) = 0 is sufficient to make x 
a strict local minimizer, we use Taylor series with second order remainder: 


f@) = (@+Vf@)"@-D+ ae ~#)" Hess f+ n(x -—D\-D 
= f(x) + 5 —X)"Hess f (¥ + s,(x —¥))(x — ¥) 


forsomes, betweenzeroandone. WejustneedtoshownowthatHess f (X + sy (x — %)) 
is positive definite provided ||x — X|| is small enough. 

We can use the Sylvester criterion (2.1.12): a real symmetric n x n matrix A is 
positive definite if and only if the determinants 


Aki + ** Akk 


Since Hess f(x) is symmetric and positive definite, this condition holds. The deter- 
minants in Sylvester’s criterion are polynomials in the second derivatives of f(x), 
and therefore continuous functions of x. Thus for some 6 > 0, ||z — X|| < 6 implies 
Hess f(z) is positive definite. Provided ||x —X|| < 6, ||sx(x —X)|| < 6 as sy is 
between zero and one, and hence Hess f (* + s,(x — X)) is positive definite as we 
wanted. Then 


f®=f@4 ae ee Goa G =a) S76) 


provided 0 < ||x — X|| < 6. Thus & is a local strict minimizer. 


Calculus-based methods using derivatives at a point give us useful information about 
how a function behaves near a point. However, it cannot tell us how the function 
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behaves far away. We need additional information about the function to tell if a local 
minimizer is a global minimizer. One class of functions for which this is easy is the 
class of convex functions. This is the focus of Section 8.2. 


8.1.3 Lagrange Multipliers and Equality-Constrained 
Optimization 


In equality constrained optimization, we look to minimize a function f(x) subject 
to equality constraints g;(x) = 0 for j = 1,2,...,m. That is, the feasible set is 


Q={x eR" | g;(x) =0 for j =1,2,...,m}={x eR"| gx) =0}. 


We will show how we can use Lagrange multipliers to give necessary conditions for 
constrained local minimizers. Lagrange multipliers were developed by Lagrange in 
his Mécanique Analytique (1788-89) [43, 151] in dealing with a mechanical system 
with constraints. 

To be precise, x* is a constrained local minimizer means that there is a 6 > 0 
where 


(8.1.1)  x*eQ and [|x —x*|| <6 andx € Q] implies f(x) > f(x"). 
We make an assumption about the functions gj (x): 
(8.1.2) if g(x) = 0 then { Vg; (x) | 7 = 1,2, vee is linearly independent. 


The condition (8.1.2) is known as the linearly independent constraint qualification 
(LICQ). The LICQ condition is sufficient to ensure that the feasible set A is a man- 
ifold of dimension n — m. In general, the solution set of even a single equation 
{x | g(x) =0} can be extremely complex; in fact, every closed subset of R” is 
{x | g(x) = 0} for some C™ function g (this is an easy consequence of the results 
in [260]). The LICQ ensures that the feasible set has a suitable structure. 

The Lagrange multiplier theorem we prove uses some results involving orthogonal 
complements (2.2.1). Recall that the orthogonal complement of a vector space V C 
R" is 

vi= {ueR" | u’v =O forallv € v}. 


The orthogonal complement of a vector subspace is another vector subspace, and the 
dimension of V+ is n — dim V. We use this for a useful piece of linear algebra: 


Theorem 8.6 /f B is anim x n real matrix, then range(B) = { Bx | x € R"} and 
null(B) = {u € R” | Bu = 0} are related through 
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(8.1.3) null(B) = range(B’)* and, 
(8.1.4) range(B) = null(B™)+. 


Proof We show (8.1.3) first. 


{x €R’ | Bx =0} = {x €R" | y’ Bx = 0 for all y } 
= {x eR" | (B"y)'x =0 forall y} 
{ 
r 


x € R”" | z’x =0 forall z € range(B’) } 


For (8.1.4), we replace B with B’ so that null(B’) = range(B’")+ = range(B)t+. 
If we take the orthogonal complement of this, we get null(B’)+ = range(B)1+. To 
complete the proof we just need to show that range(B)++ = range(B). 

We show that V'+ = V for any vector space V C R”. First, V C Vt4 as 


Vit ={y|z’y=OforallzeV"}. 


If v € V, then v’z =0 for any z € V+ by definition of V+. On the other hand, 
dim V++ =n—dim V+ =n-— (n—dimV) = dim V. For there to be any vector 
w <¢ V+ but w ¢ V we would need dim V'+ > dim V. As this is not the case, 
V = V+“ for any subspace V of R". 

Thus, null(B")+ = range(B), which shows (8.1.4). 


Given X € A, we can apply the QR factorization (2.2.8) to Vg(x)’ = OR with 
Q orthogonal n x n and R upper triangular n x m with n > m. Since the vectors 
Vg,(x) for j = 1,2,...,m are linearly independent by the LICQ, the rank of R 
is m. Then we can split Q =[Q; | Q2] with Q; nx m and Qo n x (n—™m). 
Both Q; and Q> have orthonormal columns; also Vg(x)? = Q; R, where R, is 
invertible. Then range(V g(x)’ ) = range(Q}), so by Theorem 8.6 null(Vg(x)) = 
range(Q1)! = range(Qo). 
Consider the function (y, z) > g(* + Q1y + Q2z). Now 


Vy [g(@ + Qiy + Q2z)] = Ve@+ Qiy+ Q2z2)Qi so 
Vy [g(@ + Qiy + Q2z)]|y 90,9 = VE@) Qi = (Oi Ri)" Qi = Rj, 


which is invertible. Then by the Implicit Function Theorem of multivariate cal- 
culus, there is a smooth implicit function y = h(z), and 6 > 0, where g(x + 
Q, h(z) + Q2z) = 0 for all z where ||z|| < 6. Since V, [g(@+ Qiyt+ Q>z)| = 
Ve(x + O1y + Q2z) Qo, so at (y, Zz) = (0, 0) we have the Jacobian matrix with 
respect to z is Vg(®) Q2 = (Q1R1)" Oo = R7 OT Q) = 0.So 
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0=V,[g@ + Qi h(z) + Q22z)] 
= Vge(x + Qi h(z) + Q2z) [Qi VA(z) + Qo]; 
and at (y, z) = (0, 0) the Jacobian matrix is 


= Vg(x)[Q,:Vh(0) + Q2] = Rj Vh(0). 


As R; is invertible, Vh(0) = 0. 

Now we can get back to optimization! 

We can write the constrained x = ¥ + Q,h(z) + Qoz satisfying g(x) = 0 for 
any Z with ||z|| < 6. So we do not have to consider constraints on z around z = 0. 
Then we can apply the conditions for unconstrained optimization to z+ f(# + 
Q\h(z) + Qoz). Applying Fermat’s principle to this function of z, if ¥ is a local 
constrained minimum, 


0= VG)" [QiVhO) + Oo] = VF)" Od. 
This means that V f(X) is orthogonal to every column of Q>. That is, 
V f(@) € range(Q2)~ = range(Q1) = range(Vg()"). 
That means V f (*) = Vg(x)" A for some A € R”. That is, 


0=Vf®) — Vg)’ 
(8.1.5) = Vf@®)— >> A;Vgi@). 


j=l 


The vector A is the vector of Lagrange multipliers for X. 
The Lagrangian function is 


(8.1.6) L(x,d) = f(x) — )* Aj g)(). 


j=! 


The Lagrange multiplier condition (8.1.5) is that V,£(x, A) = 0. The constraint 
equations can be recovered from the condition that OL/O; (x, A) = 0 for alli. 


Example 8.7 To see how this works, consider the problem: minc,,,) x subject to 
x? + y* = 1. Then we take f(x, y) = x and g(x, y) = x? + y? — 1. There is only 
one constraint, so there is only one Lagrange multiplier A. The Lagrangian func- 
tion is L(x, y, \) = x — A(x? + y? — 1). The Lagrange multiplier conditions for a 
constrained local minimizer or maximizer are 
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O=1-2xr (OL/Ox =0), 
0=0-2yrA (OL/Oy=0), and 
O=x7?+y?-1 (OL/AA\=0). 
There are exactly two possible solutions of this system of equations: (x, y, A) = 


(+1, 0, 5). The point (x, y) = (+1, 0) is actually a constrained maximizer, while 
(x, y) = (-1, 0) is a constrained minimizer. 


8.1.3.1 Shadow Costs and Lagrange Multipliers 


Many ask what the Lagrange multipliers mean. While it can be tempting to think 
of them as just a mathematical device, they do have an important meaning: they are 
shadow costs. These measure the rate of change of the optimal value with respect to 
changes in the constraint functions. If the constraints represent resource limits, then 
the Lagrange multipliers represent the marginal cost (or value) of each additional 
unit of the resources represented. In mechanical systems, the Lagrange multipliers 
represent generalized forces. 

To see how this works, we modify the constraints to g(x) — sy = 0. The opti- 
mal solution for these modified constraints is ¥(s). For the problem with modified 
constraints, 


Vx | f(x) — S> dj (g;@) —57j)] =0 atx = ¥(s). 


j=l 


Since X(s) satisfied g;(X(s)) — yjs =O for j = 1,2,...,m, 


f &(s)) = f(s) — Sd; [g) @(9)) — 75] 


j=l 

= fR(s)) — D> Ajai) +5 Yj; 
j=l j=l 

= L(x(s),A)+sA7y so 


d ae 
| §&(s)) = VeL(E(s), IP + AT Y 
ds ds 


= '+¥ since VyL(¥(s), A) = 0. 


This means that changing the constraints by sy changes the function value at the 
constrained optimum f (x(s)) by N+ s+ O(s?) provided all functions are smooth. 
The value of ; is the “price” per unit change in constraint i, which is why 4; is 
called a shadow price. 
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8.1.3.2 Second-order Conditions 


We can follow the theory of unconstrained optimization to obtain second- order 
necessary conditions for equality constrained local minimizers. These conditions can 
be developed in terms of the Lagrangian function: if ¥ + w € Q so that g(¥ + w) = 
0 then 


fE+w) = fS+w)— > djgiF+w) 
j=l 
=L&+w, A) 


= a 1 im 
= L(*®,A)+ Vy L@, A) w + 5w' Hess, L(, A)w + O(|\wil>) 
1 a 
(8.1.7) = L(x,A) + 5w! Hess, L(#, A)w + O(\|wll*) 
ae pk - 
= f®) + zw" Hess. L@, Ayw + O(||w||’). 


Assuming the LICQ, let us set w= QO, h(z)+ Q2z, since g(¥+w)=0. As 
Vh(0) = 0, and h is continuously differentiable, ||#(z)|| = O(||z||7) as ||z|| > 0. 
This gives ||w|| = O(||z||). Substituting this into (8.1.7) gives 


1 
f@+ OQi:h(z) + Qoz) = f+ 52' O} Hess. L@, A) Qoz + O((IzI|°). 


Thus if X is a constrained local minimizer, Co; Hess, L(x, X)Q2 must be positive 
semi-definite; further if a Hess, L(X, XA) Q2 is positive definite then X is a strict 
constrained local minimizer. 

The matrix Q3 Hess, L(X, A) Q> is the reduced Hessian matrix of the Lagrangian. 
These conditions are equivalent to 


(8.1.8) necessary conditions: V,L(¥, A) = 0 and 
d’Hess,L(*,A)d >0 forall d € null(Vg(x)) 
(8.1.9) sufficient conditions: V,L(x, A) = 0 and 


d'Hess,L(®,AX)d >0 for allO Ad € mull(Vg()). 


Example 8.8 If we apply this to min,, y) x subject to x? + y* = | withthe Lagrangian 
function L(x, y, \) = x — A(x? + y? — 1) we can determine which solution of the 
Lagrange multiplier conditions is a local constrained minimum. The solutions of the 
Lagrange multiplier conditions are (x, y, A) = (£1, 0, +3). First, 
—2r 0 
Hess, L(x, A) = 0 2 : 
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Also, Vg(x, y) = [2x, 2y]’ sonull(V g(x, y)) = { [dy, al" | 2xd, +2yd, =0 fe 
In particular, null(V g(x, y)) = span{ [y, —x]” }. If d € null(Vg(x, y)) then writ- 
ing d = s[y, —x]’ we have 


d’ Hess, L(x, A)d = s?(—2A)(x? + y*) = —2As?. 


The sign of this quantity is the negative of the sign of A: taking (x, y, A) = 
(—1, 0, —4) gives d" Hess, L(x, A)d = +s? > Ofors 4 0, and so (x, y) = (1, 0) 
is a local constrained minimizer. This is actually easy to see geometrically if you 
draw the feasible set (a unit circle) and look for the point that minimizes x on this 
circle. But we see here how to handle the situation with these more general tools. 


Exercises. 


(1) Show that if g(x) and h(y) are continuous and coercive (limMy++0 g(x) = 
limy_,+00 h(y) = +00), and if f(x, y) = g(x) +h(y), then f(x, y) is also 
coercive. Use this and uv > —$ (aru? +aq*v’) for any a 4 0, to show that 
fx, y) =x? + xy* — 10xy + 4x + y* — 2y is coercive. 

(2) Show that (x*, y*) = (2, 1) is acritical point of f(x, y) = x? + xy? —10xy+ 
4x + y+ — 2y. Using this, or otherwise, find all critical points of f. Which are 
local minimizer? Which is the global minimizer? 

(3) Show that f(x, y) =x°? +e-* + xy + y’ is coercive. Find all critical points 
of f. [Hint: Reduce V f(x) = 0 to a problem in one variable, and use a one- 
variable equation solver.] Find the Hessian matrices of f at each of the critical 
points. Identify all local minimizers and the global minimizer. What is the global 
minimum value? 

(4) Consider the function f : R*° — R given by f(x) = sar g(x;) where g(z) = 
z* — cos(3z) + sin(6z). Check that g is coercive and has six local minimizers, 
and five local maximizers. Show that f is coercive. How many local minimizers 
does f have? How many critical points does f have? 

(5) Find max, xy — |x|? for p > 1 in terms of y. 

(6) Compute the gradient and Hessian matrix of f(x) = (x? x). 

(7) Let f(x) = x" Ax with A symmetric. Consider the problem min, f (x) subject 
to the condition that x" x = 1. Show that the LICQ (8.1.2) holds for this problem. 
From the Lagrangian L(x, \) = f(x) — \ (x"x — 1), show that the constrained 
minimizer must satisfy Ax = \x and x’x = 1 where 4 is the Lagrange multi- 
plier. Thus, \ is an eigenvalue of A. Which eigenvalue gives the minimizer? 


(8) Show that the LICQ (8.1.2) holds for the constraint x7 x = a for a > 0, but not 
fora = 0. 
(9) Consider the problem of minimizing e* + xy subject to x? + y* = 1. Reduce 


the Lagrange multiplier conditions and the constraint to a single equation in one 
variable. Check that the gradient of the Lagrangian is zero. Check the reduced 
Hessian matrix at every point you identify as a potential local minimum or 
maximum. 
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8.2 Convex and Non-convex 


8.2.1 Convex Functions 


A function y: R" > R is convex if for any x, y € R" and 0 < 6 < 1, then 


(8.2.1) pdx + (1— Ay) < Oy(x) + 1 — 4) v(y). 
Convex functions are very important in optimization theory. Convex functions can 
be smooth or not smooth. But they cannot be discontinuous unless we allow “+00” 


as a value they can have. We say that ¢ is strictly convex if for any x ~ y € R" and 
0<¢é<1, 


(8.2.2) pdx + (1 — &y) < Oy(x) +d — 8) vy). 


If f is smooth there are equivalent ways of telling if a function is convex. 


Theorem 8.9 /f f: R’ > R has continuous first derivatives, then f is convex if 
and only if 


(8.2.3) SYN > FO)+VE@)" (yx) forallx,y €R". 
Proof Suppose that f is convex. Then for any x, y € R” and0 < 6 < 1, 
f(x + (1 — Ay) <6 f(x) +d— 8) fy). 
Put p = 1 — @, which is also between zero and one: 
f(A — p)x + py) < 1 — p) f(x) + p f(y). 
Subtracting f(x) from both sides and dividing by p > 0: 


f(x + ply — x)) — f(x) 
p 


< f(y) — f@). 
Taking the limit as p | 0 gives 


Vi (x) (y—x) < f(y) — f(x). 


Re-arranging gives (8.2.3). 
Now suppose that (8.2.3) holds. Put z = 6x + (1 — 6) y. Then 


f(x) => f@M+VF(@)' (x -2), 
f(y = fW+VE@"(y — 2). 


8.2 Convex and Non-convex 549 


Note that x — z = (1 — #)(x — y) and y— z= 6(y — x). Then 


6 f(x) + (1-8) f(y) 
> 6 f(z) +00 — AV f(z)" (x — y) 
+ (1-6) f(z) +l — OV f(z)" (y — x) 
= f(z) = f(@x +(1—96)y). 


That is, f is convex. 


The condition (8.2.3) is a little easier than the original definition (8.2.1) in that we 
do not need to consider the parameter @ in (8.2.3). We can take this one step further, 
by looking at second derivatives. 


Theorem 8.10 Jf f: R” — R has continuous second derivatives, then f is convex 
if and only if Hess f (x) is positive semi-definite for all x. 


Proof Suppose f is convex and has continuous second derivatives. By (8.2.3), setting 
y=x+sd,withs > 0, 


f(xtsd)> f(x) + Vf (x)'sd, 
f(x) = f@+sd)+VF(x+sd)" (—sd). 


Adding both inequalities and subtracting f(x + sd) + f(x) gives 
O>s[Vf(x)—Vf(x+sd)]' d. 


Dividing by s? gives 


T 
O> ened . 


RY 


Taking the limit s + 0 we get 
0 > —d"Hess f (x) d. 


Since this is true for all d, we see that Hess f(x) is positive semi-definite for any 
choice of x. 

For the converse, suppose Hess f (z) is positive semi-definite for all z. Then using 
Taylor series with second order remainder, we have 


1 
f(y) = fe) + V FQ)" (y — x) + 50> x)" Hess f(x + s(y —x))(y — x) 
for some 0 < 5 < 1 


> f(x) + VF(x)(y — x). 
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Since this holds for any choice of x and y, (8.2.3) holds, and by Theorem 8.9, f is 
convex. ei 


Theorem 8.10 means that we do not need to look at pairs of points (x, y) if we can 
tell if the Hessian matrix of f is positive semi-definite. 

Beyond this, it should be noted that convex functions have a good “arithmetic”: 
sums of convex functions are convex; the product of a convex function with a positive 
constant is convex; linear, and constant functions are convex. However, products of 
convex functions are generally not convex. Two other operations on convex functions 
give convex functions that are especially worth noting here: the point-wise maximum 
of convex functions f(x) = max(g(x), h(x)) are convex, and the composition of 
convex functions f(x) = g(h(x)) provided g is increasing as well as convex. The 
proofs of these results are left to the Exercises. 

While convex functions are not necessarily smooth, provided the value of a convex 
function is always finite, we can at least prove the existence of directional derivatives. 


Lemma 8.11 Any convex function yp: R” — R must have directional derivatives: 
for anyx, d €R", 


exists. 


Hee ae as, POX +h) = Oe) 
(8.2.4) yp (x; d) = ain i 


Proof We first show that h > (w(x + hd) — y(x))/h is a non-decreasing func- 
tion for h > 0. Suppose that 0 < h’ < h. Let 0 = h'/h so that 0 < 6 < 1. Then by 
convexity, 


p(O(x +hd)+ (1 —-9—)x) < Ov(x+hd)+UA-84 p(x) © so 
p(x + 6hd) < p(x) + Oly + hd) — y(x)]. 


Subtracting y(x) from both sides and dividing by h' = 6h > 0 gives 


p(x + dhd) — p(x) * p(x + hd) — p(x) 
6h - h ; 


as desired. Since this difference quotient is non-decreasing, either the limit exists, 
or the difference quotient is unbounded below giving a limit of —oo. To show 
that the limit of —oo is impossible, we first note that convexity of y implies that 
p(x) < 5px + hd) + 5y(x + h(—d)) giving —y(x + h(—d)) + v(x) < p+ 
hd) — p(x), and therefore 

_ ox +h(—d)) — p(x) _ |. pe thd) — p(x) 

lim < lim : 
h{0 h h{o h 


If the right-hand side has the limit —oo, then lim, jo(y(x + A(—d)) — y(x))/h= 
+oo, which is not possible as ht (y(x + h(—d)) — y(x))/h is also a non- 
decreasing function. Thus 
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ec Ay a te, CO TED — Gl) 
er ak A 


exists and is a finite real number. 


The condition for a point ¥ to be a global minimizer of a convex function can be 
determined just from its directional derivatives. 


Theorem 8.12 If: R” — Risconvex, then¥ minimizes ¢ ifand only if y' (x; d) > 
0 for all d. 


Proof If ¥ minimizes y, then for any d and h > 0, (p@ + hd) — p))/h = 0. 
Taking the limit as h | 0, we see that y’(X; d) > 0 for any d. 

Conversely, suppose that x does not minimize y; then there must be y where 
v(y) < y(x).Putd = y —X.Sinceh  (y(¥ + hd) — v(X))/hisanon-decreasing 
function, if0 <h <1, 


e@+hd)—p®) _ e@+1d)— yp) _ 


A < i ply) — yp) < 0. 


Taking the limit as h | 0 gives y'(¥; d) < Oford = y—¥. 
Thus x minimizes y if and only if y'(x; d) > 0 for all d. 


If f is differentiable, then the directional derivative f’(x;d) = Vf (x)'d. The 
advantage of working with directional derivatives is that directional derivatives 
always exist for convex functions with finite values. In fact, this result gives us a 
partial non-smooth version of Theorem 8.9: for convex f: R” > R, 


f(y) = f@)+ f(x; y-x)  forallx, y. 


Here, we give a quick example of how we can use this to determine optimality 
for convex but nonsmooth functions. Take y(x) = |x| + g(x) where g is smooth and 
convex. Since |x| is aconvex function of x, this y is convex. The directional derivative 
y'(x;d) = [sign(x) + g(x) d for x £0 and y'(0; d) = |d| + g’(0)d. Zero is a 
global minimizer if |2’(0)| < las then y’(0; d) => 0 ford > O and y’(0; d) => 0 for 
d <0, so the conditions of Theorem 8.12 are satisfied. 


8.2.2 Convex Sets 


A set C ina real vector space is a convex set if 
(8.2.5) x,y €Cand0 < 6 < limpliesOx+(1-O™)yeEC. 


If y: R’ > R is a convex function and L eR, then the level sets 
{x € R” | p(x) < L} are convex. Since norms are convex functions, the unit ball 
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{x € V | ||x|ly < 1} is also a convex set for each vector space with a norm ||-||,. 
If we use a strict inequality “<”, then we have an open ball; if we use a non-strict 
inequality we get {x € V | ||x||y < 1}, which is a closed ball. 

If C; and Cz are convex sets, then Cj 1 C2 is either empty or a convex set, but 
C, U C2 usually is not convex. Every real vector space is a convex set. If y: R” > R 


is convex, then the set 
(8.2.6) epiy = {(x,y)eR’xR|y>yx)}, 


called the epigraph of yp, is convex. 

Given a set S C R”, the smallest convex set C containing S is called the convex 
hull of S and denoted co S. To see why we can say “the” convex hull, suppose that 
C, # C, are two different candidates for the convex hull of S. Then C := C, NC 
also contains § and is convex, but C C C, and C3. This contradicts the claim that 
both C; and C2 are convex hulls of S. The convex hull of three points not all on a 
common line is a triangle; the convex hull of four points not all on a common plane 
is a tetrahedron. 

Many results on convex sets can be obtained through a single theorem: 


Theorem 8.13 (Separating Hyperplane Theorem) /fC is anon-empty closed convex 
set and y ¢ C allin R", then there is a vector n € R" and 2 € R where 
(8.2.7) n'x <6 forallx €C, and 
(8.2.8) n'y > B. 

This theorem can be generalized to any Banach space, so that it can be applied to 
questions about convex sets in infinite dimensional spaces. 


Proof Let py: R" > R be the function y(x) = $ ||x — y||5. Note that y is continu- 
ous and coercive, so by Theorem 8.2 there must be a minimizer x* € C of y. Ifx € C, 
then 6x + (1 — #)x* = x* + O(x — x*) € C forany0 < 0 < 1. For0 < 6 <1, 


p(x* + A(x — x*)) — p(x*) SG 
; >0. 


Taking the limit 9 | 0 we obtain 
Ve(x*)" (x — x*) > 0. 


That is, 
(x*— y)’(x—x*)>0  foranyx eC. 


Setting n = y — x* and 3 = n’x* we see that for x € C,n’x <n'x* so 
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n'x <n'x* =£, while 


n'y =(y—x*)’y =(y—x*)'(y—x*) + 6 > B, 


with a strict inequality sinceC Jy Ax* EC. 


Many important results about convex functions can be derived from the Separating 
Hyperplane Theorem. This theorem is essential for establishing the Karush—Kuhn— 
Tucker conditions. 


Exercises. 


(1) Show that if f and g are convex functions R’ > Randa > Ois areal number, 
then f + g and a f are also convex. 

(2) Show that if f and g are convex functions R” — R then the function h(x) = 
max(f (x), g(x)) is also convex. 

(3) Show that if f: R — R and g: R” — R are convex functions and f is also 
a non-decreasing function (u > v implies f(u) > f(v)) then h(x) = f(g(x)) 
is also convex. 

(4) Show that the non-decreasing condition on f in the previous Exercise is nec- 
essary, by means of the counter-example f(u) = exp(—u) and g(x) = x?. 

(5) Show that the function f(x) = /1 + x7 x is convex. 

(6) Show that the functions g(u) = — Inu andh(u) = u Inu are both convex func- 
tions of u for u > 0. 

(7) Show that linear functions f(x) = ax + b are convex. 

(8) An extreme point of a convex set C is a point x € C such that there are no points 

y, z€C where y AZ andx = 5(y +z). Show that any convex function f 

that has a maximizer over a convex set C C R”, has a maximizer that is an 

extreme point of C. Show that if f is strictly convex then every maximizer over 

C is an extreme point of C. 

Show that the function f(x) = —In(a’x +b) + D0, rj exp(e} x + dj) + 

1+ x! Bx is convex provided each r; > 0 and B is positive definite. 

(10) Show that the set of symmetric positive-definite matrices is a convex set. Also 
show that the function f(A) = — In(det A) is convex over the set of symmet- 
ric positive-definite matrices. [Hint: Compute (d? /ds”) {[—In(det(A + sE))] 
using the fact that (d/ds) det(A + sE) = trace((A + sE)~'E) det(A+sE), 
that symmetric positive-definite matrices have symmetric positive-definite 
square roots, and that the trace of a positive-definite matrix is positive. ] 


(9 


Ym 
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For s > O and small, f(x + sd) © f(x) +sd’V f(x). If d’V f(x) < 0 we say d 
is a descent direction of f at x. If we minimize d’ V f (x) over ||d||, = c then we 
choose d = —(c/ ||V f(x) |l2) Vf (x). Scaling this vector, we can use d = —V f (x) 
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as the “most efficient” direction in which to reduce the objective function. Stepping in 
the negative gradient direction is the basis of the gradient descent method. Gradient 
descent is the basis, or a fall-back, for many other algorithms. 

Computing gradients are often considered a drawback for gradient and other 
derivative-based optimization methods. This is because computing gradients are con- 
sidered either expensive or inconvenient or both. But computing gradients need not 
be either excessively expensive or inconvenient if automatic or computational dif- 
ferentiation is used (see Section 5.5.2). 


8.3.1 Gradient Descent 


The simplest version of gradient descent is to simply step a small, but fixed amount, 
in the negative gradient direction. This is shown in Algorithm 72. 

To analyze gradient descent algorithms, we make the assumption that V f is a 
Lipschitz function with Lipschitz constant L: 


(8.3.1) IVF@)—VFMll < L ile — yl, forallx, y. 


For d = —V f(x), 
f(x+sd) = f(x) +sd'Vf(x) + [ d' [(Vf(x+td)—Vf(x)]dt, so 
0 


< f(x) —sIIVFC)II3 +f IId|lo IV f(x + td) — V f(x) IIo dt 
0 
(by the Cauchy-Schwartz inequality) 


2h esiv7ts B+ f IVF()Ilp tL |ldllp dt 


2 J 2 2 
= f£@)—sIIVF@l2 + 58h IVF) 


(8.3.2) = f(x) —s|IVF@)3 E = stl. 


Algorithm 72 Simple gradient descent 

1 function simplegraddescent(f, V f, x0, 5, €) 
2 k<0O 
3 while ||Vf(xx)|| > 
4 Xep1 HX —S VF (xR) 
5 k<k+1 
6 
7 
8 


end while 
return Xx 
end function 
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In order to ensure a decrease in the function values so f (X441) < f(x,) provided 
V f (xx) 4 0, we should have 0 < s < 2/L. As we often do not have good estimates 
for L, the Lipschitz constant for V f, we usually make s > 0 small. But the smaller 
we make s, the more iterations are needed to achieve a target reduction in the function 
values. Larger s > 0 should mean larger steps, and hopefully, fewer steps to come 
close to a minimizer. 

We should also note that we do not guarantee that the x, converge to a global 
minimizer. At best, we can only expect to approach a local minimizer. Even this is 
not always true. For 0 < s < 2/L, all we can guarantee is that either f(x;,) — —oo 
or V f (xz) > Oas k — ov. To see why, we start with 


1 
fen) < fr) — sIIVEOWDIF 1 - 5st | ; that is 


1 
f (Ker) — f (xg) < —8 IVE IS E — | : 


Summing both sides from k = 0 tok = M gives 


1 M 
fxm) — f 0) < -s E - 5s YoIV Fon). 


k=0 


Flipping the signs, and the inequality, gives 


1 M 
f (xo) — fms) > 1 - 5st Div fowls, 
k=0 


so taking M — oo gives 


1 [o,0) 
f(xo) — inf fxn) = s 1 - 5s Div few. 


k=0 


Either the left side is infinite (f(x,) — —oo) or 6 IV f (xx) 13 is finite. Thus if 
f is bounded below, then V f(x;,) > 0 ask —> oo. 

To see how quickly the method converges, it helps to see how it works for a simple 
problem. Consider, for example, the quadratic function f(x) = 5x’ Ax —b'x+c 
with A symmetric and positive definite. A step of the simple gradient descent method 
gives 


X41 - A 'b =x, —5(Ax, +b) — A-'D 
= (I —sA)(x, — A 'D). 


The rate of convergence of this fixed point iteration is the spectral radius (2.4.5) 
pU — sA) = max(|l — sAmin|, |1 — 5Amax|) where Amin and Amax are the minimum 
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and maximum eigenvalues of A respectively. So, in this case, the optimal s = s* = 
2/(Amin + Amax) and the optimal linear convergence rate is given by 


Amax _ Amin 1— (min /Amax) 


I-—s*A)= = 
At : ) max + Amin 1 + (min /Amax) 
Amin 2 
x~1-2 = 1 
Amax K2(A) 
where k2(A) = A! l, | Allo = Amax/Amin is the 2-norm condition number of A. 


Note that Amin > 0 as A is assumed positive definite. 
More generally, for convex functions we have an asymptotically slower bound: 


Theorem 8.14 Suppose f : R" — R is convex and has continuous first derivatives 
where V f is Lipschitz with constant L, and the minimizer of f is x*. Then provided 
0 <sL <1 in Algorithm 72, 


2 
xo — x*||5 


(8.3.3) fxn) — FR") s ask 


Proof From (8.3.2), f (x41) < f(x%) - $s IV Falls as 0 <sL <1. Since f is 
convex, 


f(x) > fxd + VF (x) (x* — x), 80 
f (xn) < fe") + V Fen)! (xe — X*). 


Thus 
For) S fled) — ZIV FOB 
< fe) + Vf)" Gx") — Se IV OIE. 
Then 


feng) — f@*) < — | 28VF Cy)? (xy — x*) — 5? IVfen3| 


[ocx —2* [3 — fee — 3* [3 + 280 F Cea)? xe — 2") — 97 IVF DIS | 


Je — 2°13 - fee -—2* -sV F003] 


| 
| 
| 


PRE Pl- el Pl- 


[xe — 2" 13 — [reer — 2" 13]. 


Summing over k gives 
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Algorithm 73 Gradient descent with line search 
1 function graddescent_Is(f, V f, xo, linesearch, params, €) 
2 k<0 
3 while ||Vf(x;z)|| > « 
4 dy <— —Vf (xx) 
5 Sx. < linesearch(f, V f, xk, dx, params, Sx—1) 
6 Xep1 HX + Sd 
% 
8 
9 
1 


k<k+l1 

end while 
return Xx 
0 end function 


p-l 


1 
(Fe) = Fe) <= [0-27 - lee — 2°] 
k=0 
1 «2 
< 5, leo 2")2. 


Since f(xx41) < f(x) for all k, 


1 
P(f@p) — Ff") < as ||xo —x*|>. 


Dividing by p gives the result. 


The value of Theorem 8.14 is that the bounds are not asymptotic and do not depend 
on the condition number «2(Hess f(x*)). But whatever method of analysis is used, 
the step length s must be “small enough” for the method to work. Simply fixing its 
value does not guarantee it. We need a more adaptive method. We need line searches. 


8.3.2. Line Searches 


Knowing the negative gradient direction does not give any indication of how far in 
that direction the algorithm should step. Line searches are needed in all but the most 
basic gradient descent algorithms. Line search algorithms make gradient descent 
algorithms much more robust. The basic idea of a line search algorithm for gradient 
descent is shown in Algorithm 73. 

The ideal line search computes the minimum of f(x + sd) over s > 0. We want 
to ensure that s = 0 is not the minimizer as this would mean that no progress would 
be made. Usually, this is prevented by ensuring that d is a descent direction for f at 
x. Setting d = —V f(x) ensures that d is a descent direction. 
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Algorithm 74 Armijo/backtracking line search 


1 function armijo(f, Vf, s0,x,d,c1) 
2 g@<Vf(x); s<—5S9 

3 while f(x+sd) > f(x) +c1sg?d 
4 Ss <— 355 
5 end while 
6 
v 


return s 
end function 


8.3.2.1 Armijo/Backtracking Line Search 


The simplest widely used line search algorithm is the Armijo/backtracking algorithm 
shown as Algorithm 74 [8]. For the method to terminate, we need sg > 0 and 0 < 
c, < 1. The result of Algorithm 74 is s where 


(8.3.4) s>0O and f(x+sd) < f(x)+c1s d'V f (x). 


Condition (8.3.4) is called the sufficient decrease condition. 
The Armijo/backtracking algorithm terminates in finite time (assuming no round- 
off error) because an infinite loop would imply that 


_ f(x + 592-*d) — f(x) 
lim 


Jim Sa Sd BV SN 


and thus V f(x)’d > 0 contradicting the assumption that d is a descent direction. 


8.3.2.2 Goldstein Condition-based Line Search 


A related method is the Goldstein line search algorithm [103]. This method is based 
on satisfying two conditions, the first of which is the sufficient decrease criterion: 


(8.3.5) fixtsd)<f(x)+ ec sd™Vf(x), 
(8.3.6) f(xtsd)> f(x) +U—c)sd' VF (x). 


In order to satisfy both conditions, it is necessary that 0 < c, < 1/2. An algorithm 
to implement the Goldstein search is shown in Algorithm 75. 

The Goldstein line search algorithm terminates provided f is continuously dif- 
ferentiable, d is a descent direction, and f is bounded below, since 


e the loop on lines 6-8 terminates for the same reason that the Armijo/backtracking 
algorithm terminates; 

e the loop on lines 9-11 terminates since otherwise s,; — oo and f(x + syd) < 
f(x) + (1 — ¢1)s,; V f(x)’ d — —ov violating the assumption that f is bounded 
below; 
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Algorithm 75 Goldstein line search 
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1 


OANA BPWDN DY 


PRPERPBRERY 
AIAN KRWNRO 


function goldstein( f, V f,x,d,s,c}) 


v<Vf(x)'d // slope of f(x+sd) at s=0 
GC1 <[f(x+sd) < f(x) + csv] 
GC2 <—[f(x4+sd) > f(x) +0 —c)s v] 
if GCI and GC2: return s; end if 
Slo <-S} Shi <8 
while not GCI 
Slo — 5810; GC] —[f (e+ Sid) < f(x) + 1510 VI 
end while 
while not GC2 
Shi — 28nii GC2 — [f(x t+ spid) = fX)+ UA — ep)sni v] 
end while 
while true 
§ <= (Slo + sni)/2 
GC1 < [f(x+sd) < f(x) + csv] 
GC2 <— [f(x +sd) > f(x) +d —c1)s v] 
if GCl 
if GC2 
return s 
else 
Slo <— S 
end if 
Shi <8 
end if 
end while 


end function 


the loop on lines 12—24 maintains the properties that (8.3.5) holds at s = sj, while 
(8.3.6) holds at s = s;;. If the loop on lines 12—24 were infinite, then |s,; — sjo| > 
0 as k — oo leading to a point s = s* being the common limit of s;, and s,;. Both 
(8.3.5) and (8.3.6) hold at s = s*. Since 0 < cy < 1/2, both (8.3.5) and (8.3.6) 
hold for s near s*, so that the algorithm terminates once s;, and s;,; are sufficiently 
close to s*. Thus the assumption of an infinite loop is false. 


It should be noted that the choice of s <— + (Sto + s,;) in line 13 is not the only 
choice. It should be noted that hybrid methods discussed in Section 3.4.3 can be 
used to identify other formulations that are faster in general. Unlike the methods 
in Section 3.4.3, we can terminate the Goldstein line search algorithm as soon as 


Goldstein conditions hold, rather than attempting to solve a nonlinear equation. 


8.3.2.3 


Wolfe Condition-based Line Search 


The Wolfe conditions [262, 263] impose conditions on V f (x + sd) as well as on the 
value f(x + sd). The strong Wolfe conditions are the sufficient decrease criterion 
(8.3.4) 
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(8.3.7) f(x +sd) < f(x) +csd'Vf(x) 
and the “curvature” condition 
(8.3.8) ld’ V f(x +sd)| <o|d' Vf (x). 


Note that to guarantee that both of the Wolfe conditions can be satisfied, we have 
to assume that f is continuously differentiable, f is bounded below, and 0 < c; < 
Cc. < 1. There are also the weak Wolfe conditions, where the curvature condition 
(8.3.8) is replaced by 


(8.3.9) d'V f(x + sd) > cod’ Vf (x). 


Lemma 8.15 (Wolfe conditions) /f f : R” — R is bounded below and continuously 
differentiable with d'V f (x) < 0, then there is a solution s > 0 of (8.3.7, 8.3.8) 
provided 0 < cy <c2 <1. 


Proof Suppose that f(z) > fing for all z € R”. Then (8.3.7) implies that there is a 
finite 
sn = inf {s >0| f(e+sd) > f(x) +csd'Vf(x)}. 


The set defining s;,; is non-empty because f is differentiable, bounded below, and 
d'V f(x) < 0. Also s;; > 0 as for sufficiently small s > 0 we have f(x + sd) < 
f(x)+ cisd' V f(x). Note that f(x + syid) = f(x) + c18pid’ V f (x). Note that if 
0 <5 < sp;, the condition f(x + sd) < f(x) + cisd' V f (x) holds. 

By the Mean Value Theorem, there must be a point 3 strictly between zero and 
Spi where 


0 > cisyid' V f(x) = f(x + suid) — f(x) = snd’ V f (x +5a). 
Dividing by s;,; gives d’ V f(x +3d) = c,d’ V f (x) and so 
ld’ V f(x +5d)| =c1 |d' Vf (x)| < 2 |d'Vf(x)]. 


Furthermore, there is an interval of values of s where both (8.3.7) and (8.3.8) 
hold. 


An algorithm to find a solution of the strong Wolfe conditions (8.3.7, 8.3.8) is given 
in two parts as Algorithms 76 and 77 [190, pp. 60-61]. Algorithm 77 shows the inner 
“zoom” function, while Algorithm 76 shows the outer function that calls zoom(). 
These algorithms are written in terms of the function ¢(s) = f(x + sd) and ¢'(s) = 
d'V f(x +sd). 

There are two competing objectives in selecting the step length s > 0: we want the 
step to give a reduction in the objective function which can be achieved by choosing 
a small value of s. On the other hand, making a “safe” choice of s by making it small, 
means that more steps will be needed to obtain the same reduction in the objective 
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Algorithm 76 Wolfe condition based line search — outer function 


1 function wolfe(¢, ¢’, so, Smax) 
2 choose 5s, € (0, Smax) 
3 k<1 
4 while true 
5 LE (se) > GC) + c15e¢'(O) or [h(sx) = O(se-1) and k > 1] 
6 return zoom(¢, &', SK—1, Sk) 
7 else if |d(s)| <c2|¢'()| 
8 return Sx 
9 else if d/(s%) =>0 
10 return zoom(¢, ¢', SK, Sk—1) 
al end if 
12 choose syp41 € (Sx, Smax) 
13 k<k+l1 
4 end while 
5 end function 


Algorithm 77 Wolfe condition based line search — zoom 


function zoom(¢, ¢’, Sic. Shi) 


2 while true 

3 obtain new estimate s between sj, and Sp; 
4 if (5) > 60) +cis dO) or O(s) = (si0) 
5 Shi <8 

6 else 

7 if |(9)| <e2|HO| 

8 return s 

9 else if @'(s)(Spi — S19) = 0 

10 Shi <— Slo 

11 end if 

12 Slo <8 

13 end if 

14 end while 


15 end function 


function value. The Wolfe conditions (8.3.7, 8.3.8) ensure both that the step length 
s > 0 is both “not too large” and “not too small”. 


8.3.2.4 Choice of Line Search Parameters 


The line search parameters are c; for the Armijo/backtracking method and the Gold- 
stein condition-based search, and both c; and cz for the Wolfe condition-based line 
search. There are some essential conditions that must be satisfied by these parameters: 


e Armijo/backtracking: 0 < c; < 1. 
e Goldstein: 0 < c; < 1/2. 
e Wolfe: 0 < cy < c <1. 
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The smaller cp is, the tighter the line search is. The exact or ideal line search method 
can be approximated by a Wolfe condition based line search with cz small. 

However, looser line searches are preferred in practice. Looser line search condi- 
tions mean that fewer function evaluations are needed between updating the line 
search direction d,. For Newton and quasi-Newton methods (see Sections 8.4 
and 8.5), the initial choice of step length is known (so = 1), and this choice will 
often work. In this case, looser line search criteria are definitely preferred. 


8.3.3 Convergence 


Gradient descent algorithms cannot be guaranteed to give convergence to a global 
minimizer. Generally, we might expect that gradient descent algorithms would con- 
verge to a local minimizer. In fact, the best that can be guaranteed for gradient descent 
type algorithms is that they converge to a stationary point: V f(x) = 0. Consider, 
for example, the function f(x, y) = x? — y?. This function is unbounded below, but 
if we start from (xo, 0), xo # 0, then any gradient descent algorithm or even New- 
ton method would converge to (0, 0). Any perturbation (x9, yo) with yo 4 0 would 
result in the iterates (x;,, yz) where yy — --oo as k > oo. So we cannot guarantee 
convergence to a local minimizer, just to a stationary point. 

We also assume that our objective function f is bounded below, and its gra- 
dient V f is Lipschitz continuous. Zoutendijk’s theorem was originally for Wolfe 
condition-based line search methods but we prove that the conclusions also hold 
for Armijo/backtracking and Goldstein line search methods. The search directions 
d, do not need to be negative gradient vectors, but can be any descent direction 
(di V f (xx) < 0). 


Theorem 8.16 Suppose that f : R" — R is bounded below with Lipschitz contin- 
uous gradient: ||V f(x) —Vf()ll2 < L |x — yll, forall x, y € R". Also suppose 
that every search vector d,, is a descent direction (that is, di Vf (xx) < 0). Then 
provided either the weak Wolfe conditions (8.3.7, 8.3.9), the Goldstein conditions 
(8.3.5, 8.3.6), or the Armijo/backtracking algorithm (Algorithm 74) is used with 
$0 |ldxllo = c3 IV f Ilo > 0 for all k, we have a constant C depending only on 
f (Xo) — inf, f(x), L, c1, c2, and c3, where 


(8.3.10) Y\cos*(Zdy, Vf (xx) IVF WIZ < C- 
k=0 


The essential point of this theorem is that the angle Zd,, —V f (x,) between the 
search direction d, and the negative gradient direction —V f (x;) should not become 
and stay too close to 7/2. 

The central part of the proof is finding a constant C’ where 


f(x) — f Xep1) = C’ cos*(Zdy, Vf (xx) IV feadII5 - 


8.3. Gradient Descent and Variants 563 
Proof Let 0 = Zdx, V f (xx), 80 cos Oj = di V f (xK)/ (della IV f (xe) lo). 

Whichever line search method is used, the sufficient decrease criterion must be 
satisfied. That is, 


SF (Xxe) < fe + syd, Vf (xx) where X44) = X% + Spd. 


Using the Lipschitz continuity of V f, 


Ft, +sdy) = flee) + / ALY f (eg + td) dt 
0 
< fle) +sd™V fe) + i Idelly L lit dgllp dt (by (8.3.2)) 
0 
1 
(8.3.11) = f(xy) +sdi Vf (x) + 5 Ls Id l5 . 


We now need to use the different conditions for the different methods to obtain a 
lower bound on s;, of the form constant x |di VS (xn) / \\di II3. 


(i) Wolfe conditions: From (8.3.9), dL Vf (Xn41) > cod V f (Xx). Since V f is Lip- 
schitz, ||V f (xz) — VF (xe) Ilo < L se ||dell2. Thus 


A.V f (xn) =a, Vf (xen +d, (VS (eer) — VF (x4) 
> dL VF (xn) + lidella L lixees — xell2 
> dV f(xy) +L sy Wk ll5. 


Then cd) V f (xx) < d, Vf (ev1) < dV F (xx) + L sx Idi (|5. This gives 


, d=) div feo 
EL Ide 


Sk 


(ii) Goldstein conditions: From (8.3.6), f(x.41) = f(x.) + dU - cr) sd, V f (xx). 
Then using (8.3.11) with s = s, we have 


1 
f(x) + dL Vf (xx) + 58k dlls => fxn) + —c1)sd,V f(xx), and so 


1 
as dilly => —c1 sed, Vf (xn) = c1 Se [dE Vf (xx)]- 


Dividing by s; > 0 gives 


> 26 |a; Vf (xx)| 
~ L ldgll 


(iii) Armijo/backtracking: Case 1: 5; = so. 


564 8 Optimization 


IVf@oll — dev Feo] 
= C3 5S 
IIdillo Idi lls 


Sk = 80 = C3 


Case 2: 5; < so. Then the sufficient decrease criterion must be false for s = 2s,. 
That is, 
f (Xa + Qsedu) > f(x) + c1 We dL VF (Xx): 


Using (8.3.11) with s = 2s, gives 
1 
f (xp) + 25d, Vf (XK) + 5 E25)” WIdcll5 > fx) +e1 2d, VF (Xx). 
Therefore 
2L sf \Idxlly > 2(c1 — Used, Vf (we) = 201 — c1) sx |d 7 VF (Xe) | . 
Dividing by 2Ls, ||d, I > 0 gives 


(L—c1) ld VF (xn) 
Ildxll 


Ss > 


Given a lower bound s, > c4 \di Vf (xx)| / Idi l3. we can substitute for s; in the 
sufficient decrease criterion: 


whe 
f (evi) — fx) < cid V f (xe) < cues EEO ary paxy 
k lla 
a7 Vf (xx) a7 Vv fan) : 
— —_— V 
ee dell “eT IVfanie toe 


= —cc4 cos? IV f (xe )II3- 
Reversing the direction of the inequality, 
f (ee) — f Ceri) = eres cos” I IV Faw) - 
Summing over k we have a telescoping sum on the left: 
[o.e) 
f (Xo) — lim f (x4) = crea) cos” & IV Few - 
k- oo =p 


The left-hand side is bounded above by f (xo) — inf, f(x), which is finite as f is 
bounded below. Thus 
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se cos? 0 ||V f (x4) I 


k=0 


is finite, as we wanted to show. 


Zoutendijk’s theorem (Theorem 8.16) and its variants show that line search methods 
converge globally to a stationary point under mild conditions. However, this theorem 
does not give much information about how quickly these methods converge. We can 
show that for the special case of exact line searches (di. V SF (x% + sed) = 0) applied 
to a strictly convex quadratic function f(x) = 5x7 Ax — b'x +c with A symmetric 
and positive definite, we can show a linear rate of convergence that depends on the 
condition number of A. Kantorovich’s inequality [137, 243] is used to give the upper 
bound. 


Theorem 8.17 Suppose f(x) = $x" Ax — bx + cwhere A is symmetric and pos- 
itive definite. Let f* = inf, f (x). Then if steepest descent (dy = —V f (xx)) is used 
with exact line search (dV f (xx + spd) = 0) then 


KAS 


2 
a) [ f(x.) — f*] — forallk. 


(8.3.12) (een =f" = ( 
Proof First note that V f(x) = Ax — b. Since f is convex (Hess f(x) = A is 
positive definite everywhere), the sufficient condition for a global minimizer at 
x* is Ax* =b. Also, f* = f(x*) = 3(A7'b)"A(A“'B) — b' (Ab) +c0=c-— 
+b’ A~'b. Then using Ax* = b and A = A’, 


1 1 

f(x) — ft = 5x" Ax — (Ax*)'x +e—0+ SBT A™'b 
1 I 

= 5x/ Ax — (Ax*)Tx + 5 (e*)TATA TAX 


1 
= 5 —x*)' A(x — x"). 
Let g, = Vf (xx) = Ax, — b. Note that g, = A(x, — x"). 
The exact line search condition implies that diV fm) =0= dj (A(x, + 
s,d,) — b), sos, = —dj. g,/(d;, Ad). For steepest descent, d, = —g,. This means 
Sk = 8, &x/(g} Ag,). The update is x44) = x% — 9 B,. SO 


1 
ftw) = f= 5 Ores x")? Atay — x") 


1 * T * 
= 5k — x" — seep)’ AXE — X* — SKBy) 


1 1 
= 5 - x*)"A(xp — x") — (segy)” A(xe — x") + 558k Ag, 
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1 
= (f (xe) — f*) — 5kB 8e + =5¢8, ARE 


2 
oh 
(gigi 1 ( Bg r 
=(fay=f + g;, Ag 
( ) Bi AS, 2 8 AS, oe 
1 (gi 84)” 
= (fx, — f* : 
( ) 2 gj Ag, 


But f (xx) — f* = 5g, A 'g,. So 


($i 81)" | . 


_ pee “1 
f esi) — f= (Fx) r)| (gf Aq'g,)(8{ Ag,) 


To bound the expression 


(g7g)? 


8.3.13 : 
; (g7 Ag)(g?A-'g) 


above in terms of «2(A), we expand g in terms of eigenvectors v; where Av; = A;0;; 
since A is symmetric we can choose the v; to be orthonormal: g = > 7 Vj¥j- Since 
A is positive definite, 4; > 0 for all 7. We can order the eigenvalues 0 < A; < 
A2 < +++ < An. Note that K2(A) = A,/A1. By orthonormality of the eigenvectors, 
g’g=)°, 7; while g7Ag =>), Ajyj andg’A'g=)>,, A; Then 


(g’g) _ Cea 
(g7 Ag)(g? A~'g) Oo anp0L 47°) 


Setting zj = oF >, Y7, we can write 


(g’g)° 1 


8.3.14 1 = 
: (g7 Ag)(g? A~'g) CE A a 


Note that z; > 0 for all j, and that Dar zj = 1. So maximizing (8.3.13) over g is 
equivalent to maximizing ()~ j AZO, j Ape j) over non-negative z;’s that sum to 
one. 

Let y(A) = A7! and 7)(A) the linear interpolant of y at A = \, and \ = X,. Note 
that vy is positive and convex on [A;, A,] C (0, oo). Let A= a zjAj € LAr, And. 
By convexity, 


eA) = v0" z)A;) < 002, PAD < Doz OA) = VO_ ZA). 
J J j J 


The last equality holds since w is affine and )* j%j = 1.Since yr) = "lis positive, 
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(D> A;2)0Q > AF'27) = PAT Fz PAs) < GO) *A) = AGO). 
j j j 


The task now is to bound A wW(A) for A € [A1, An]. The function \ A w(A) is con- 
cave since 7 is affine and decreasing: (d/dA)*[A W(A)] = 2’ (A) + AWA) < 0. So 
we seek where (d/dX)[A W(A)] = 0. This maximum occurs at X = (A; + A;,)/2. 
Substituting this into 4 ~(A) gives (Ar + Aw ZOAz! +A = 5(2 + (Ai/An) + 
(An/A1)). Using this bound in (8.3.14) gives 


(g"g) 7 1 
(g’Agy(g™A'g)~ —— S(A, + An) SQA) +n) 
4NAn Or —An)? _ (K2(A) — 1)" 
Ortrn)? Orta a + 7 


Finally, we see that 


(A) — 1 


2 
aS] (f x) = ‘ae 


fens - 2 = ( 


as we wanted. 


While Theorem 8.17 is for an idealized line search and only gives an upper bound, 
the dependence on (A) is clearly observed in practice. The bound (8.3.12) gives a 
bound for one step. It is possible to get better performance from gradient descent if, 
for example, —V f(x) happens to point nearly directly to the minimizer. This will 
happen in the case of quadratic functions if x — x* is nearly an eigenvector of A. 
But if K2(A) is large, small deviations from being an eigenvector will result in wildly 
incorrect directions. 

To illustrate the behavior of gradient descent with exact line search, we show 
a plot of the function values f(x,) against k in Figure 8.3.1. This is shown for 
f (x) = 3x7 Ax where A = diag(100*/” | k =0,1,2,...,m) with n = 20 and a 
randomly chosen Xo. Note that «2(A) = 100. While there is an initially rapid reduc- 
tion in the function value, this slows to a geometric rate. This geometric rate is 
estimated to be (f (xx41) — f*)/(f (xg) — f*) © exp(—4.9259 x 10-7) by using 
the function values at k = 50 and k = 100, while ((K2(A) — 1)/(K2(A) + 1)? = 
exp(—4.000 x 10-7). The number of iterations needed for fm) — f* <€ is 
asymptotically log(1/e)/log(1/p), where p is the geometric rate of decrease per 
step. Thus the bound is overestimating the number of iterations needed by about 
25% (not including the initial rapid reduction). 

Why is there such a rapid initial rate of decrease? This is not a peculiarity 
of this example. This behavior is often observed. Suppose that the direction of 
xo — x* is uniformly distributed. Then gy = V f (x0) = A(xo — x*) will be directed 
strongly towards the eigenvector of A with the largest eigenvalue. This makes 
1 — (g? g)?/((g" Ag)(g? A~!g)) close to zero, and we see rapid initial convergence. 
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Fig. 8.3.1 Function value vs 102 
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8.3.3.1 Modification for Inexact Line Search Methods 


Theorem 8.17 assumes exact line searches. If we use Goldstein or Wolfe-condition- 
based line search methods instead, we can still obtain similar bounds. If x;4; is the 
result of either line search method and ¥;, the result of an exact line search from 
x; in direction d; = —V f (xx), then provided f is convex and quadratic, there is a 
constant 0 < c* < 1 (depending only on c; and cz) where 


Fr) = fe) ot 
f(x) — fre) 7 


For the Goldstein conditions (8.3.5, 8.3.6), we can take c* = 4c;(1 — c,). For the 
Wolfe conditions (8.3.7, 8.3.9), we can take c* = min(4c,(1 — c;), 1— ee If the 
Armijo/backtracking is modified so that the returned step length s satisfies the 
sufficient decrease criterion (8.3.4) but 2s does not, we can prove (8.3.15) with 
c* = min(4c;(1 — c)), 1 — ce) 

Once we have (8.3.15), we can modify the one-step bound for exact line searches 
(8.3.12) to 


(8.3.15) 


Apa 
(83.16)  fltei— ft < { -c)+e (2o—) | re - f] 


for Goldstein, Wolfe, or modified Armijo line search methods. 
Even for non-quadratic functions, these bounds matter: once x; is close to x*, we 
can approximate 


f(x) © f(x*) + V(x")! (x — x*) + ae —x*)"Hess f (x*) (x — x"). 
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Fig. 8.3.2. Steepest descent with Armijo/backtracking (cj = 0.1) applied to the Rosenbrock func- 
tion (8.3.17) 


So Theorem 8.17 should give a good indication of the behavior of steepest descent 
methods with line search close to a local minimizer, provided the Hessian matrix 
at the local minimizer is positive definite. In practice, we often see zigzag behav- 
ior, especially when using looser line search methods. For example, using steepest 
descent with the Armijo/backtracking line search method (c; = 0.1, xe = [2, 3]) 
for the Rosenbrock function 


(8.3.17) f(x, y) = 100 (y — x”)? + (x — 1)’, 


gives the zigzag behavior shown in Figure 8.3.2. The dashed curve in Figure 8.3.2 is 
y =x’. After k = 10+ iterations, the distance between x, and the global optimizer 
x* =[1, 1] is still approximately 1.08 x 1077. 


8.3.4 Stochastic Gradient Method 


The stochastic gradient method is a stochastic version of gradient descent, specifi- 
cally adapted for large data problems. This method is often called stochastic gradient 
descent. However, the method does not guarantee descent of the objective function, 
so here the word “descent” is replaced by “method”. 

A typical large data optimization problems is to find, given a set of data points 
{ (xi, y,) |i =1,2,...,N \, the weight vector w that minimizes 


N 
(8.3.18) f(w) = = ite I w)+ Rw), 


i=l 
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Algorithm 78 Stochastic gradient method. 
al function stochgrad(p, xq, (so, 51, 52,.--), 7) 
2 for k=0,1,2,...,n—1 
3 sample & from distribution of € 
4 X41 <— XK — Sk Ve P(XK, Ek) 
5 end for 
6 
7 


return Xn 
end function 


where £ is the loss function, measuring the error in the estimate of y; given the 
“input” data value x; using the weight vector w. The function R(w) is a regularizer 
function, that is used to prevent w from becoming “too large”. 

A common issue in many big data optimization problems is that the data set is too 
large to keep in memory, or to process efficiently as a single unit. Instead, the idea is 
to randomly select an index i in {1, 2,..., N} and do a partial gradient descent step 
with respect to the loss function for the data point (x;, y;): 


(8.3.19) Writ <— We—-—S Vy [e(x;, y;; w)t+ R(W)) 0, . 
The value of s > 0 is usually chosen to be small and fixed. The parameter s can be 
referred to as a step length parameter, although machine learning specialists often call 
s the learning rate. If L isa common Lipschitz constant for w +> Vyl(x;, y;; w) + 
R(w),i = 1,2,..., N, then we choose 0 < s < 1/L. No line search is performed 
as computing f(w) for any w requires accessing the entire data set, and is therefore 
very expensive. Performing a line search to minimize ¢(x;, y;; w) + R(w) along 
w = w; + sd, can result in a large step in a bad direction. 

It should be noted at the outset, that simply using the update (8.3.19) with a 
random choice of i for each update will not converge except in special situations. 
However, if s > 0 is small, then the variance of the w,;’s will approach a small but 
non-zero value as k — oo. 

Richtarik et al. [188] have an analysis of a family of algorithms for minimizing 


f(w) = Ee [y(w; €)], 


where € is arandom variable. In this generalization, each iteration chooses a sample 
€ from its distribution and performs a gradient descent step in w: 


Wet — We — 5 Vw [P(W; E) ww, - 


The previous form of stochastic gradient descent can be recovered by making € = i 
the random variable which takes each value in {1, 2, ..., N} with equal probability. 
This generalized algorithm is shown in Algorithm 78. 

Algorithm 78 does not include any line search. As noted above, performing a line 
search on v(x + sd, €) is unlikely to reduce E [y(x, €)], and accurately evaluating 
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“ [y(x, €)] is expensive. This means that the step lengths s, should be chosen judi- 
ciously. These step lengths should not be constant but should decrease to zero as 
k> oO. 


8.3.4.1 Analysis of Stochastic Gradient Method 


Before we continue, we need to note that iterates x; are themselves random vari- 
ables, along with €, and €. Also, €, is chosen independently of x,, but x,,1 is not 
independent of €,. Here V(x, €) is the gradient of y(x, €) with respect to x. 

The analysis here is inspired by [188] but is simplified to present the essential 
aspects of the analysis. 

We make the following assumptions: 


(8.3.20) V v(x, €) is Lipschitz continuous in x with constant L for all € ; 
(8.3.21) Var [y(x, &)Jand Var [Vy(x, £)] are bounded, indpendently of x. 


Since each x; is arandom variable, except possibly x9, we must be somewhat careful 
in analyzing the method. Starting from 


Xep1 = XE — SeVO(KE, Ex), 


we use the standard methods of analysis to show that 


1 
pre. €) < yer, + Vern, ©)" (es — HK) + 5b liter — xxll5 


1 
= pry, &) — % Volxn, 8)" Vern, &) + 5 Es Vern, &I- 


Given the value of x, we then have the independent random variable & used 
to compute x,,,, and another independent random variable € used for obtaining 
f(x) = E[y(x, &)]. Both €, and € have the same probability distribution. Taking 
expectations conditional on the given value of x, we get 


o[f et) | xe] = E[peres, 8) | x] 
< Ely(x, €) | xe] -— x E[ Vern, 8)" Vo(rn, &) | xx] 


1 
+ sh E[IVo(re, GI | xe]. 


Note that by independence of € and & 


o[Vio(xe, €)7 Vo(xe, &) | xe] 
= E[Vy(xr, € | xxl’ E[Ve(xe, &) | xx] 
=Vfwll3, 
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and that 


E[WVe rn, G3 | xe] = WELV ere, &) | xelllf + Var (Vern, &)] 
= ||VF (xe )I5 + Var [Vern &)]. 


Then we see that 
1 1 
E[f(xe¢i) | xx] < ftw) — KC — 58K) IV fees + 5 bs Var [Vep(xe, &) | xe]. 


If we bound Var [Vy (xx, &&) | x¢] < M by (8.3.21) and assume 0 < s, < 1/L, then 


1 1 
[fv |e] S fee) — 55 IVS COIR + SLMS. 


Note that we cannot guarantee that f(x,41) < f(x,) no matter how small s; > Ois. 
Summing over the iterations k and taking expectations over all possible sequences 
of iterates, we have 


n—1 n—1 


1 
DL f (Xn) — f(x0)] < — >) seE [IVF xa dII3] + 5LM >) sg so that 
k=0 k=0 


n—-1 n—1 


dE [IV FOI] SELF G0) — few] + SLM Ds. 


k=0 k=0 


To say anything useful about the gradients as n —> oo, we need oa se to be 


bounded; so we assume that 7°) 57 is finite, but )~?°9 5, is infinite. In that case we 
can show that 


lim inf E[||V f (xx) ll2] = 0. 


> 0o 


That is, for any «, K > 0 there is ak > K where [IV f(xx)Il2] < e. If we keep 
5, = s constant for all k, then we can show that 


lim inf E[||V f (xx) |lo] < - Ms=QO(s) ass J0. 


k-oo 


In this case we do not expect convergence of V f(x,), but rather V f(x,) is usually 
small while there may be occasional large spikes. 


8.3.4.2 Speed of Convergence: The Linear Least Squares Case 


Consider a simple linear least squares problem 


8.3. Gradient Descent and Variants 573 
1 N 
: Fox 
min W 201 —Xx;a). 
i= 


Each term is y(a,i) = (i — x? a). We can use Algorithm 78, sampling 7 from 
{1,2,..., N} uniformly with Vy(a,i) = 2(x} a — y;)x;. The minimizing a = a* 
is given by aa (x/ a* — y;)x; = 0. Letting y; = y; — x} a*, the optimality condi- 
tions become a 1x; = 0. If i = i(k) is the sample chosen from {1,2,..., N} at 
step k, then 


T 
Ay) = Ay — 28K (X; Ak — Vi) Xi 


=(UI- 25,x jx] Jay + 2s, yj Xj. 
Using y; = 7; + x/ a", 
Apy1 — a* = (1 — 25,x; x} )(ay — a*) + 25,7; X;. 


Taking expectations, 


S[axy1 — a*] = E[( — 2spxixj (ay — a*)] + 2x xi]. 


But x; is independent of a, as i is chosen independently of a;, and E[y;x;] = 
No! ES jx; = 0, s0 


D [ay Lb a*| = J —- 2s, i [x:x7 ]) y [ax _ a*| : 


Let B=E|[x;xf7]=N ae x;x/; this is a symmetric positive semi-definite 
matrix. We suppose that it is positive definite. If e, = a, — a* then 


= [ex+i] = U — 25, B) E lex]. 


If we keep s, = s for all k then we still get E[e,] > 0 as k > of provided 
S Amax(B) < 1. Furthermore, this convergence is geometric: 


|E [ex+1] ||, < max(|1 — 25¢Amin(B)|, [1 — 25% Amax(B)|) IE ex llo - 


The real difficulty is that variance of e, does not go to zero geometrically unless 
7; = 0 for all i (the “no noise” case). The equations are 


T T Ty2 T T 
Cees = €y U — 25x jx; eK + 4spyix; UE — 254K ix; eK 


22.7 
+ 489; xX; Xj. 


Taking expectations and using independence of 7, x; with e; gives 
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© [lex+i3] = E [ez U — 25.E [x;x7 ])?ex] 
+ 45cE [yx] (I — 2s,xix7)] Elec] 
+ 4s7E [7x7 xi] ; 


Note that E [et | = B,and E[e,] > 0 geometrically if s, = s > 0 forall k. Also, 
as E[yx7] = 0’,E [yx? UI — 25.xjx7)] = —25cE [yx] xix] |. If |xill, = 1 for 
all i, then E[ yx? x;x7] =E[y2x7] = 07. Let l =E [y7x/x;]. Note that r > 0 
with equality if and only if +; = 0 for all i (“no noise’). Supposing, for example, 
that ||x; ||. = 1 for all 7, 


n 


2 [llecsill3] = Ele, UZ — 2s, B)e,|] — 85,E [yx) x;x/ | E [ex] + 45,0 
S [el (I — 2s, B)°ex| + 450 
(1 — 2¢Amin(B))” E [llexll3] + 4520, 


A 


provided |1 — 2s,Amax(B)| < 1 — 25,Amin(B) for all k. From these equations it is 
clear that for E [llexll5] to go to zero as k + oo we need 5, > O ask > o~. If 
V9 Sk = +00 and )°°2) 52 < +00 then we get e, > 0 ask > oo. 


8.3.5 Simulated Annealing 


Simulated annealing [203, 226] was developed as a means of global discrete opti- 
mization. The physical insight is that a physical system that is cooled very rapidly 
often only goes part way towards the minimum energy configuration, and often ends 
up in a local but not global minimum of the total potential energy. On the other 
hand, cooling the same system slowly allowed the system to come close to the global 
minimum of total potential energy. 

Temperature here relates to thermal energy, which is kinetic energy at a micro- 
scopic level. Thermal energy allows increases in the potential energy at the level of 
individual molecules and atoms. This, like the stochastic gradient method, is a non- 
monotone optimization method. That is, the objective function value can sometimes 
increase, even though we wish to minimize this value. 

Simulated annealing is also a stochastic optimization process. However, there 
are some differences with Stochastic Gradient Method. While the step lengths s;, in 
Stochastic Gradient Method are typically chosen with the property that }°?°. s¢ = 
+00 and )7?° 4 si finite, in simulated annealing, the corresponding step lengths s; 
decrease much more slowly. 

The starting point for simulated annealing is the Metropolis—Hastings algo- 
rithm (Algorithm 71 of Section 7.4.2.3). Let X be the state space we wish to 
optimize over. We choose the function qg generating the probability distribution 


P(z) = q(2)/ Divex G(X) to be 
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Algorithm 79 Simulated annealing. The input U is a generator of independent uni- 
formly distributed values over (0, 1). 


dl generator simannealing(f, g, sched, x9,n, U) 
2 for k=0,1,2,...,n—1 

3 Be < sched(k) 

4 sample x’ from g(x | xx) 

5 if next(U) < min(1, exp(—A(f ’) — f (xx))) 
6 Xkey <— x! 

7 else 

8 Xkt1 <— Xk 

9 end if 

10 yield xx41 

lal end for 


12 end generator 


q(x) = exp(—6 f(x)) 


where f is the function we wish to minimize. Physically @ corresponds to 1/(kgT) 
where kg is Boltzmann’s constant and T the absolute temperature. The larger the 
value of @ the more concentrated the distribution is near the global minimum; at the 
other extreme, if 6 = O simulated annealing becomes essentially a random walk. 

If we take 3 = +00, then the only steps of the Metropolis—Hastings algorithm 
that are accepted are steps where the candidate iterate x’ satisfies f(x’) < f (x,); this 
is arandomized descent algorithm: pick a neighbor of x, at random. If the neighbor 
has a smaller value of /, this neighbor becomes x;,+1. Otherwise x,41 = xz. Clearly, 
this method will become stuck in local minima. 

The task then is to choose a value of 3 > 0 that will give sufficient concentration 
near the global minimizer. Choosing the value of 3 too small will mean that it takes a 
long time to leave a local minimizer; choosing the value of ( too large will result in the 
algorithm simply “exploring” the state space X without much regard for the objective 
function. The ideal then is to start with a small value of @, but increase it slowly so 
as to prevent the iterates from becoming stuck in local minimizer. This corresponds 
to “annealing” a physical system by slowly reducing the temperature. A complete 
algorithm can be seen in Algorithm 79. Note that the factor g(x’ | x.)/g(xe | x‘) inthe 
Metropolis—Hastings algorithm (Algorithm 71) is typically not used in simulating 
annealing, since accurately sampling from a given probability distribution is not 
crucial for optimization and leads to greater computational complexity. 

In most applications, the state space X is the set of vertices of an undirected graph 
G. The function g(y | x) can then be taken to be 1/ deg(x) whenever y is a neighbor 
of x and zero otherwise. That is, sampling x’ from g(x | x,) amounts to picking a 
neighbor x’ of x, with equal probability. Note that in this case, if the degrees of the 
nodes are unequal, then the asymptotic probability distribution for fixed ( is no longer 
proportional to exp(—(3 f (x)), but rather proportional to deg(x) exp(—( f (x)). The 
factor of deg(x) is usually not particularly important in most applications. 
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paf(x) 


paf(x) 


(c)B=4 


Fig. 8.3.3 Simulated annealing estimated probability distribution for f(x) = (x2 — 1)? for 8 = 1, 
2, and 4 


Figure 8.3.3 shows the estimated probability distributions obtained by simulated 
annealing for f (x) = (x? — 1)? where n is the number of steps of simulated annealing 
used. The values of (3 for the plots are 1, 2, and 4, respectively. At each step starting at 
x, we choose one of x + 6 each with probability 1/2. The value of 6 used for gener- 
ating Figure 8.3.3 was 6 = 0.1. The number of steps used n = 10*, 10°, 10°, and 10’. 
Since f has two local minima, both of which are also global minimizers, the equilib- 
rium probability distributions are bi-modal. As (@ increases, the probability distribu- 
tions become increasingly concentrated around x = +1. Note that the estimated prob- 
ability distributions for n = 10’ are very close to the actual equilibrium probability 
distribution p(x) = exp(—( f (x)) / fo exp(— f (y)) dy. However, for n = 10+, 
the estimated distribution is very far from the actual equilibrium distribution. 

As n becomes larger, it is clear that the estimated probability distribution should 
come closer to actual equilibrium distribution. However, it is remarkable how large 
n has to be to obtain a close approximation to the actual equilibrium probability 
distribution. There are two reasons for this. One is that where f(x) varies slowly, 
the simulated annealing process acts like a random walk. This means that in one 
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dimension, the distance traveled in n steps is O(6 ./n). The second reason is that the 
local maximum of f(x) at x = 0 acts as a barrier between the two local minimizers. 
For the largest value of 3 = 4, the barrier is the strongest, and for n = 10* the 
estimated probability distribution is almost entirely on the left-hand side (x < 0). 

How slowly should @ be reduced in order to avoid becoming trapped? This is 
shown in Laarhoven and Aarts [251] to be 


(8.3.22) 2 

r 
where I’ is the depth of the deepest local minimizer that is not the global minimizer. 
In practice, this is too slow for most users as it will result in every x € X being 
visited infinitely many times for finite X. Most users use a faster “cooling schedule” 
than this but lose the guarantee that the global minimizer will be found. The point of 
using randomized search algorithms is precisely to avoid exhaustive search. 

For the common case where X is the set of vertices of a graph G, with neighbors 
chosen equally likely, the structure of G can have a great impact on the performance 
of stochastic search algorithms for optimization. Just how this structure affects the 
performance of simulated annealing and other stochastic search algorithms is, at time 
of writing, still only partly understood. 


Exercises. 


(1) Use gradient descent to minimize f(x, y) = x* + xy? — 10xy + 4x + y* —2y 
starting from (x, y) = (0, 0) using Algorithm 73, first with no linesearch (the 
linesearch function simply returns a constant value) using step length parameters 
s =10-*, k = 1, 2,3, and stopping the algorithm when ||V f(x;)||, < 107°. 
Record the number of function and gradient evaluations for each step length. Also 
use gradient descent with the Armijo/backtracking algorithm (Algorithm 74) 
for finding the step length, using so = 1. Compare the number of function and 
gradient evaluations for this method with the method without adaptive linesearch. 
Repeat the previous Exercise with the Rosenbrock function f(x, y) = (x — 
1)? + 100(y — x?) starting at (x, y) = (—1, 0). 

(3) In this Exercise, we consider two simple modifications to the 
Armijo/backtracking linesearch: (1) set the “start” step length sg equal to 
the step length s used in the previous step; (2) set the “start” step length so 
equal to twice the step length s used in the previous step. Compare the number 
of function and gradient evaluations used in applying the modified methods 
to minimizing the Rosenbrock function f(x, y) = (x — 1)? + 100(y — x7)? 
starting at (x, y) = (—1, 0), stopping when ||V f(x;)||. < 10-3. Are either of 
these modifications worthwhile? Why do you think so? 

Use gradient descent to minimize f(x, y) = x? + xy* — 10xy +4x + y* —2y 
starting from (x, y) = (0, 0) using Algorithm 73 with each of the three linesearch 
algorithms outlined: Armijo/backtracking (Algorithm 74), the Goldstein algo- 
rithm (Algorithm 75), and the Wolfe linesearch algorithm (Algorithm 76, 77). 
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Use the stopping criterion that || V f (x;)||, < 1077. Use reasonable or suggested 
values for the parameters involved. Which method performs best? 

For x* an approximate minimizer of f(x, y) = x? + xy? — 10xy +4x + y* — 
2y computed by one of the previous Exercises, compute the eigenvalues of 
Hess f (x*). Compare the rate of convergence you observed with the results of 
Theorem 8.17. Note that Theorem 8.17 gives a bound for an exact linesearch 
method. 

Separate the terms of the function f(x, y) = x? + xy? — 10xy + 4x + y+ —2y 
into separate functions y (x, y) = x7, Y2(x, y) = xy’, etc., and set p(x, y, €) = 
ye(x, y) for € = 1, 2,..., 6. We can then apply the stochastic gradient method 
(Algorithm 78) to minimize f(x, y), sampling € from a uniformly distribution 
over {1,2,..., 6}. Do this first for fixed step lengths s, = so for all k with 
sy = 10-4, j = 1,2, 3. Stop when ||V f(x;,)||, < 1077. Next use Algorithm 78 
with s, = 10~'/(k + 1)?/*. Compare the rates of convergence of these methods 
with each other and with deterministic gradient descent algorithms. 

Let d(x) = —2/(1 + 2x?) + x7/10, and let f(x) = $ (f(x —-1)+¢@+4+1)). 
Use the stochastic gradient method (Algorithm 78) for minimizing f with update 
Xer1 <— XE — SK O'(x — &) where & is sampled from {+1} with each choice 
equally likely. Compare using s; = 0.2 for all k, and using s, = 0.5 k~*/*. There 
are two global minimizers of f: x ~ +0.9248753266450438. Create a histogram 
for the values x; with bins of width 10~?. Use N = 10’ steps of the stochastic 
gradient algorithm, and plot the resulting histograms against x; use a logarithmic 
scale in the vertical axis. What is different about the two histograms you generate? 
What does this imply about the ability of the stochastic gradient algorithm to 
find the global minimizer of a function with multiple local minimizers? 

Apply the simulated annealing method (Algorithm 79) to minimizing the func- 
tion f: R?° + R from Exercise 4: for a point x € R”° the neighbors of x are the 
points x + 7e; where e; is the jth standard basis vector and 7) = 10—. Start from 
a randomly generated point, and run the method for 10’ steps using a constant 
value of (3. Use the values 6 = 5, 1, 2, 4. How close are the points generated 
by simulated annealing to the global minimizer? [Note: It should be expected 
that simulated annealing will give close to correct values for some components 
of the vector x,. How many components of x; are close to the minimizer of g?] 
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8.4 Second Derivatives and Newton’s Method 


The standard first-order conditions for the unconstrained minimization of a function 
f: R" > Rare V f(x) = 0. This is a system of n equations inn unknowns. We can 
apply the multivariate Newton method (Algorithm 43 in Section 3.3.4) and many of 
its variations to the problem of solving V f(x) = 0. This requires solving the linear 
system (Hess f (x,)) dy = —V f (xx) where Hess f (x) is the Hessian matrix of f at 
x. 
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Newton’s method has the advantage of rapid quadratic convergence when starting 
close to the solution and the Hessian matrix at the solution is invertible. However, 
the cost of each iteration can be substantial if the dimension n is large. If n is large, 
then sparse solution methods (see Section 2.3) or iterative methods (see Section 2.4) 
can be used to solve the linear systems that arise in Newton’s method. 

Newton’s method in this form converges to solutions of V f (x) = 0. This is a nec- 
essary but not sufficient condition for a local minimizer except under special assump- 
tions. For example, if f is smooth and convex, then V f(x) = 0 is both a necessary 
and sufficient condition for a global minimizer. In general, there are saddle points 
(where Hess f(x) has both negative and positive eigenvalues) and local maximizers 
(where Hess f(x) has only negative eigenvalues). Newton’s method regards saddle 
points and local maximizers as equally valid solutions of V f(x) = 0. Yet, for opti- 
mization purposes, we wish to avoid these points. A consequence of this behavior of 
Newton’s method is that the Newton step d, satisfying Hess f(x,) dx = —V f (xx) 
might not be a descent direction: dl Vf (xx) = —Vf(x,)" (Hess f (xe)! V f (Xx) 
could be positive for an indefinite Hessian matrix. This means that any line search 
method based on the sufficient decrease criterion (8.3.4) can fail. This includes the 
Armijo/backtracking, Goldstein, and Wolfe-based line search methods. 

In order to accommodate Hessian matrices that are not positive definite, we need 
to modify the Newton method. We can do this by modifying the Hessian matrix used 
[190, Sec. 3.4]. We can also use a different globalization strategy than using line 
searches, such as trust region methods [190, Chap. 4]. 

Hessian modification strategies solve the equation Byd, = —V f (x,) ford, where 
B, = Hess f (x,) + E;, where E; is asymmetric matrix designed to make B; positive 
definite. As long as B, is positive definite, 


dV f (xx) = —dj, Bid, < 0, 


provided d;, 4 0. We can choose E = aI where a > 0. More specifically, we set 
a = Oif Hess f (x,) is already positive definite. Alternatively, if \min(Hess f (x%)) < 
0, we can set a to be @ = —Amin(Hess f(x,)) + 7 ||Hess f(xx) ||, where y > 0 
ensures that the condition number «2(B,;) is bounded above by 1/7. 

Another Hessian modification strategy is to modify the Cholesky factorization 
algorithm to give the Cholesky factorization L,L{ = B, + D, where Dy, is a diag- 
onal matrix with non-negative diagonal entries. The diagonal entries of D, are com- 
puted as needed to ensure that a factorization exists. However, we should not use 
diagonal entry that is too close to the minimum necessary to continue the factor- 
ization, as this will result in excessively large numbers in following steps of the 
factorization. To see why consider 


_[ eo _ for 
B= (05. o=([55 |. 


If G < 0 then we need 6 > —@ > 0. Computing one step of the factorization of 
B+ D gives 
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V5+B ie _ | Ree) 
b/ Jd + BIT B+ D— bb" /(6+ 8) | I , 


B+D=| 


If 5 + 3 © Othen b/./5 + A will be large, and subsequent entries in D will also need 
to be large to compensate for —bb’ /(5 + 3). Since this can happen at many of the 
following steps in the factorization, the resulting linear system will not give a useful 
direction for searching. For more details of an algorithm that handles these issues by 
ensuring that 6 + ( is sufficiently positive, see [190, Sec. 3.4]. 

An alternative approach is to use a trust region method (see [190, Chap. 4] or [54]). 
At each step of a trust region method we have a quadratic model function m,(d) := 
sd" Bid +Vf(x,)'d + f (xx) where B, is either the Hessian matrix Hess f (x,) 
or some approximation to it. Instead of minimizing m;(d) over all possible d (which 
gives the usual Newton step d = —B, 'V f (x,)), we minimize over a trust region 


(8.4.1) I|d|| < Ax. 


Usually, the norm used for the trust region is the 2-norm, but other norms can be 
used if they are computationally convenient. If we use the 2-norm, then the following 
conditions are equivalent to minimizing d over the trust region [190, Sec. 4.3]: 


(By + AD dk = —V f (xx), 
MzO and A; — |I\dx\|, = 0 
Ae(Ax — Ildell2) = 0. 


The update step depends on how well the model represents the true objective function. 
To make the comparison we compute 


_ Fen) — fret de) 
(0) — me (da) 


If px is positive and sufficiently large (say py > 7 where t > 7 = 0) then we accept 


the step: x44) < x, + dx. Otherwise x44) < xx, which is a “null step”. If pz is too 
small or negative (say p, < 1/4), then the trust region should be shrunk (say A;4., <— 
5 min(Ax, ||d,||)). If px is large enough (say pg > 3/4) and ||d,|| = A, then we can 
increase the size of the trust region up toa maximum size: Ag,; <— min(2 Ag, Amax). 
If neither of the update formulas for the trust region size is used then Ay; <— Ag. 
Further details, including how to solve the trust region optimization problem are 
given in [190, Chap. 4]. 

The trust region method can be applied with B, = Hess f(x,) even where 
Hess f (xx) is not positive semi-definite. Trust region methods can also be applied if 
B, is simply an approximation to Hess f (x;), with the cost of degrading the rate of 
convergence. 
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Exercises. 


(1) Use the basic Newton method for minimizing f(x, y) = x7 + xy? — 10xy + 
4x + y* — 2y with the following modifications: (1) use Armijo backtracking line 
search, and (2) if Hess f(x,) is not positive definite, then use d; = —V f (xx). 
Report the number of function, gradient, and Hessian evaluations. 

(2) Show how requiring the sufficient decrease criterion for || V f (x)II3 will pre- 

vent convergence to a local minimizer if x, is close to a critical point ¥ where 

Hess f (¥) has negative eigenvalues. 

For nonlinear least squares problems with f(w) = N7! +S Oi - gi(w))* the 

Gauss—Newton method just uses Ist derivatives of the g; functions. First show 

that 


(3 
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N N 
Hess f(w) = 2N7! b» Vgi(w) Vgi(w)” + >) (g:(w) — yi) Hess ao | : 


i=l i=1 


The basic Gauss—Newton update is 


N Bee 
| ag bp V gi (wy) vec” Y-(gi(we) — yi) Vgi (we) 


i=l i=l 
Sx < linesearch(x,, dx, ...) 


Xep1 <— Xe + sed 


where the step length parameter s; is either one, or determined by a line search 
method with starting value s = 1. 


(a) Show that d, is always a descent direction provided the vectors 
{Vg;(wz) |i = 1,2,..., N} are linearly independent. 

(b) While the rate of convergence is not expected to be superlinear, explain why 
convergence is rapid if g;(w,) © y; for alli. 

(c) For most statistical fitting problems, the Gauss-Newton method converges 
quickly for large N. Suppose that g;(w) = g(w, x;) where the data points 
(x;, vj), i = 1,2,..., N, are sampled independently from the same prob- 
ability distribution. If (X, Y) random variables with the same probability 
distribution, use the Law of Large Numbers to show that 


N 
NA! 2 Vei(w) Vgi(w)? > E[VwgQw, X)Vwaw, X)" | 


i=1 


N 
No! S(si (w) — y;) Hess g;(w) > E[(g(w, X) — Y) Hesswg(w, X)] and 
i=l 


N 
N~! “(gj (w) — yi) Vegi (w) > El(g(w, X) — ¥) VwgQw, XD]. 
i=l 
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Note that if Vyg(w, X) and Hess,g(w, X) are close to constant, then 
E[g(w, X) — Y] ¥ 0 and so Hess f(w) ¥ E[Vyg(w, X)Vwg(w, X)" | if 
w is close to the minimizer w*. 

(d) Show that the Gauss-Newton method can be implemented by performing a 
QR factorization of J(w) := [Vgi(w), Vg2(w),..., Vgn(w)]. 


(4) Implement a method for solving the problem min, g?x + 5x7 Bx subject 


(5 


(6 


(7 


) 
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to ||x||2 < A by solving the conditions (B + AJ)x = —g with A>0, A\>0 
implies that ||x||, = A and B+ AI is positive semi-definite. We can do this 
by solving y(A) = 0 where y(A) = ||(B + AJ)~'g||, — A if B+ AZ is pos- 
itive definite and +00 otherwise. If B is positive definite and |B! g|, < 
A then we set 4 = 0. Otherwise, find Amax where we can guarantee that 
| (B+ Amaxl)7!g | ,<A and perform the bisection method for solving y(A) = 
0. Test this on some randomly generated symmetric matrices B and randomly 
generated vectors g. Make sure that the desired conditions are satisfied by the 
computed solution x. 

Nesterov and Nemirovskii [187, Chap. 2] outline a theory of self-concordant 
functions: a convex function f: R” — RU {+00} is M-self-concordant if for 
any x and d, 


3 3 
|(a3 /ds3) f (x + sd)| ap <2M~"? provided f(x + sd) < +00. 
[(d2/ds?) f (x + sd) 


Nesterov and Nemirovskii showed that if f is M-self-concordant, then applying 
a guarded Newton method with the sufficient decrease criterion, the number of 
steps needed to achieve ||V f (x;) ||, < nisOU + M~'(f (xo) — min, f(x)) + 
log log(1/7)) as 7 | 0. The hidden constants do not depend on f, xo, M, or 77. 


(a) Show that if f is M-self-concordant, then g(x) = f(Ax + BD) is also M- 
self-concordant with the same M. 

(b) Show that if f; is M,-self-concordant and fs is M2-self-concordant, then 
fi t+ f2 1s M-self-concordant for M = min(M,, M2). 

(c) Show that if f is M-self-concordant, then a f is (a M)-self-concordant for 
a>0. 

(d) Show that f(w) = — Inu is 1-self-concordant over u > 0. 


Implement the modified Cholesky factorization algorithm outlined in [190, 
Sec. 3.4] to compute B + D = LL’ where B is the given matrix and D is 
diagonal with non-negative diagonal entries. Test it on randomly generated sym- 
metric matrices, and on matrices B = X’ X with X randomly generated. Is it 
consistent with the standard Cholesky factorization for positive-definite matri- 
ces? For matrices B that are not positive definite, compare the largest diagonal 
entry of D with the size of the largest negative eigenvalue of B. 

The dog-leg trust region method gives an approximate solution to the trust 
region problem min, g’ p + $p'Bp subject to ||p|l, < A: Let p; = —sg 
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where 0 < s < A is chosen to minimize g’ p + 5p’ Bp with p = —sg. Let 
Pp)» = —B'g if B is positive definite, and p, = p, otherwise. We choose 
the approximate solution to be the point p = (1 — #)p, + @p, that minimizes 
g' pt $p'Bp over 0 < @ < 1 subject to ||p||, < A. Implement this method. 
If you wish, incorporate this into a trust region algorithm and test the complete 
method on the Rosenbrock function. How does the complete method perform 
compared to other Newton-type methods? 

(8) A common criticism of Newton-based methods is that they are expensive to 
implement for large-scale problems as computing the Hessian matrix Hess f (x,) 
is already very expensive in time and memory compared to computing function 
values and gradients. For f(x) = (x’x)* with x € IR” show that Hess f@~= 
8xx! + 4(x7x)I which requires O(n’) floating point operations. On the other 
hand, show that for any vector uw, Hess f (x)u can be computed in O(n) floating 
point operations. 

(9) A\ For large-scale optimization problems, keeping the critique of the previous 
Exercise in mind, first give arguments that the cost of computing Hess f (x)u is at 
most proportional to the cost of computing f(x) using automatic differentiation. 
If Hess f (x) is always positive definite, then the conjugate gradient method can 
be used to compute the solution of the Newton equation Hess f(x)d = —V f(x). 
Otherwise, describe how to use the Lanczos iteration to implement a trust region 
method. 


8.5 Conjugate Gradient and Quasi-Newton Methods 


Conjugate gradient methods for solving Bd = —g with B symmetric positive definite 
were developed in Sections 2.4.2 and 2.4.2.1. Part of the derivation relied on the fact 
that if B is symmetric positive definite then solving Bd = ~—g is equivalent to finding 
the minimum of ¢(d) := 5d "Bd + g'd +c. Conjugate gradient methods can be 
generalized from convex quadratic functions to general smooth functions. However, 
in generalizing the method, some of the properties that could be proven for solving 
linear systems no longer hold for general optimization problems. This further means 
that some expressions that are equivalent in the context of solving a linear system of 
equations are no longer equivalent in the context of general optimization. 
Quasi-Newton methods build a sequence of positive-definite approximations 
By, © Hess f(x;) by means of rank-1 or rank-2 updates using the changes in the 


gradient V f (x,) — V f (xx_1) and position x, — x41. 


8.5.1 Conjugate Gradients for Optimization 


It is noted in Section 2.4.2.1 that part of the conjugate gradient method can be 
derived from the condition that x,,,; minimizes the convex quadratic function 
d(x) = 5x7 Ax — bx over x € span { Po. Pio---> pr}, together with the property 
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Algorithm 80 Conjugate gradients algorithm — version 2 
1 function conjgrad2(A, b, xo, ©) 


2 k<0; ro< Axo—b; po <—-—ro 
3 while |Irgllo > € 

4 Gx — APy ; 

5 OK — perk / PEAK 
6 Xk41 “Xk + A Py 

7 etl e+ Kg, 

8 Be <— Phares /P pre 
9 Prat — —Tke+i + Pe Dx 
10 k<k+l1 

11 end while 

12 return Xx, 


13. end function 


that p/ Ap j = Ofori # j (the conjugacy property). The conjugacy condition implies 
that x.41 — xX, € span {px}. so that we can find x44. = x, + Sx p; by using an exact 
line search. 

We can generalize this algorithm to functions f(x) that are neither quadratic 
nor convex, and to use inexact line search methods. However, in this process, we 
lose some properties of the method. For convex quadratic objective functions and 
exact line searches, we have the following properties of the iterates of the conjugate 
gradient method (recall that r; = Ax; — b): 
res ie, 


i 


r 
r} p; =0 if j <i, 
piAp;=0 ifi Fj, 
span { po, .--, Px} span {7ro,..., 1x} 
= span {ro, Aro, ee es, for all k > 0. 


This is the main content of Theorem 2.20. 

In order to avoid flipping pages to compare with the original algorithm, we repeat 
the unpreconditioned standard linear conjugate gradient algorithm (Algorithm 27) 
as conjgrad2. 

From this, we will see how to derive the conjugate gradient algorithm for optimiza- 
tion. First, if f(x) = 5x7 Ax —b’x +c, then the residual r = Ax — b = Vf (x). 
So we replace the computation of the residual r; on line 7 with r, < Vf (x;). The 
other thing to remember is that in the general situation, there is no matrix A. The 
closest thing we have to A for a general smooth function f is Hess f (x), the Hessian 
matrix of f at x. But in general, Hess f(x) is neither constant nor easily computable. 
So we should avoid any explicit reference to A. These references occur on line 4 in 
computing q,, which is used to compute the step length a, on line 5 and the update 
of the residual on line 7. For the step length, we simply use a suitable line search 
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Algorithm 81 Conjugate gradients algorithm for optimization — version 1 
al function conjgradoptFR(f, V f, Xo, linesearch, params, €) 


2 k<0; ro<Vf(xXr)i Po ——To 

3 while ||rgllo > € 

5 Sx. < linesearch(f, V f, Xk, Py, params) 
6 Xk4EL XE + SK DE 

7 ret — Vf (ee+1) 

8 Be <— Phares /P pre 

9 Pray — ret + Ge De 

10 k<ek+l1 

oe end while 

12 return Xx, 


13. end function 


algorithm. This gives Algorithm 81 (conjgradoptFR) below, which is known as the 
Fletcher—Reeves conjugate gradient algorithm. 


8.5.1.1 Line Search Algorithms for Conjugate Gradient Optimization 
Algorithms 


Now we need to ask: what makes for a suitable line search algorithm for this gen- 
eralized conjugate gradient algorithm? All of the line search algorithms we have 
discussed assume the direction of the line search (here p;,) must be a descent 
direction: p) V f (xx) = pire <0. This is clearly true for k= 0 as py = —Po. 
But can we guarantee this will be true for k = 1,2,...? If we use exact line 
searches, then s, minimizes f(x, + 5 p;,) over all s > 0, and so DLV f (XK + 
SkPx) = PLV f (Xk41) = PLTe+1 = 0. From line 9, we then have 


T T 
Pyilky = (req + Bx Dy) e+ 


T 
= Tey le <9, 


provided r;41 4 0, as we wanted. But in practice, we can only approximate exact 
line searches, and attempting something close to an exact line search can be very 
expensive in terms of function evaluations. 

Since the descent direction condition involves gradients, we need a line search 
method that involves V f(x). This leaves Wolfe condition (8.3.7, 8.3.8) based line 
searches as the only practical way to properly implement conjugate gradient methods 
for optimization. As you may recall, we have two parameters for the Wolfe conditions: 
c, for the sufficient decrease criterion, and cz for the curvature condition. In order 
to guarantee the existence of a step satisfying (8.3.7, 8.3.8) we need f bounded 
below as well as being smooth, and 0 < c; < cp < 1. This can approximate exact 
line searches by making cz > 0 small. 

If exact line searches are used so that p{ V f (xx + 5¢ Px) = 0, then we can guar- 
antee that p,,, is a descent direction. If we use a Wolfe condition-based line search, 
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what value of c. > 0 can guarantee the same property? For the Fletcher—Reeve con- 
jugate gradient method, it turns out that 0 < cz < 5 is sufficient to ensure that pj, 
is a descent direction. The reason for this is the following Lemma [190, Lemma 5.6, 
p. 125}: 


Lemma 8.18 The Fletcher-Reeve conjugate gradient algorithm with Wolfe 
condition-based line search with 0 < cz < 5 satisfies the condition 


T = 
1 Z V f (Xx) Px d 2c2 
l-c2 7 (Vf x) Il 1—c 


forallk. 


A proof is given in [190, pp. 125-126]. 

Lemma 8.18 implies that V f(x,)7 p, <0 for all k and so p, is a descent 
direction for all k. However, it does not imply that p; is always a good descent 
direction. Zoutendijk’s theorem (Theorem 8.16) shows good global performance 
if —Vf (xx)! p./ UV f elle | Py I) is bounded away from zero. The guarantee 
of Lemma 8.18 is that —V f (xx)! p;/ IV FIIs is bounded away from zero. The 
Fletcher—Reeve method can run into trouble if | Px | 3/ IV Fx) |lz > 1. Then p, will 
be close to orthogonal to — V f (x;), and the line search will then be fairly short. This 
results in X44; © Xz, SO Vf (Xe41) © Vf (xx). The Fletcher-Reeve formula then 
gives % = IV fF ee v3 / IVF eZ © Land pyiy = —Vf ert) + CePe © Pro 
so that | Pes l, / WV f (x41) |. >> 1 and the problem continues. 

This effect can be seen in Figure 8.5.1, where the spiral shows the behavior of 
the Fletcher—Reeves conjugate gradient method where this happens. The objective 
function for this application of the method is f(x, y) = y? — x*y — 3y +.x4 — x3 + 
2xy — x which has a global minimizer near (— 1.457, 4.019). Wolfe condition-based 
line searching was used with c; = 10-7 and cy = 0.2. On the other hand, Figure 8.5.1 
shows much better behavior of the Polak—Ribiére conjugate gradient method (8.5.1), 
which goes much more directly to the minimizer. 

In spite of this possible poor performance of the Fletcher-Reeves method, it has 
been proven to be globally convergent under mild conditions [190, Thm. 5.7]. 


8.5.2 Variants on the Conjugate Gradient Method 


The formula for rc R ri ghee / riTk in the Fletcher—-Reeves conjugate gradient 
algorithm could be changed to give the same behavior in the ideal case of convex 
quadratic objective functions and exact line searches, but hopefully better behavior 
for more general objective functions and inexact line searches. 

One of these is the Polak—Ribiére method, which uses the formula 


(8.5.1) Bee = (rear — re) Pepi /P Erk: 
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Fig. 8.5.1 Fletchers—Reeves 
vs Polak Ribiére conjugate 
gradient methods 


-2 -1 0 1 2 3 


The Polak—Ribiére formula gives identical results to the Fletcher-Reeves formula 
in the ideal case since for these problems, rj. 44° = 0. Other formulas equivalent 
in the context of the ideal case include the Hestenes—Stiefel formula, which was 
used in the original paper on conjugate gradient methods [122]: ia =F) yi — 
re) (eet — 1k)” De- 

Of these alternatives, the Polak—Ribiére method is perhaps the most popular. 
However, even if exact line searches are used, it is possible for this method to 
fail to converge to a critical point [208]. Part of the issue with this example 
is that the value of a alternates in sign. Since in the ideal case, G, > 0 for 
all k, another modification that works well is using the Polak—Ribiére-plus method: 
GER* = max ((rezi — re)" re41/r_ Tk, 0), which can be shown to be globally con- 
vergent. 

In the situation where the Fletcher—Reeves has difficulty (|| Pr | s/ IVF All = 
px ||,/Wrell2 >> 1, the Polak-Ribiére and Polak-Ribiére-plus methods do 
not have difficulty: if rey; = Vf (xe+1) © Vf (xn) = re then ae = (rei - 


re resi /P Lk © Oand pyyy = —rey + Ge = —r,4, and the method “resets” 
to a gradient descent algorithm. Since 3P8* = max(f®, 0), we also get GPR* ~ 0 


which also leads to a kind of “reset” of the Polak—Ribiére-plus method. This is 
illustrated in Figure 8.5.1. 


8.5.3 Quasi-Newton Methods 


Quasi-Newton methods build an estimate of the Hessian matrix from the sequence 
of computed gradients and steps in order to obtain superlinear convergence, similar 
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to Newton’s method. These methods are generally considered superior to conjugate 
gradient methods as quasi-Newton method do not rely as heavily on the line search. 

The first quasi-Newton was developed by the physicist William Davidon*, who 
was in search of a better method than co-ordinate descent for some energy calcu- 
lations. The method, now known as the Davidon—Fletcher—Powell (DFP) method, 
starts with a default initial approximation Bo to the Hessian matrix, and after each 
step updated the estimate B, as follows: 


(8.5.2) Brat = 1 — peyys,) Be — prseyp,) + Pry,¥, Where 
(8.5.3) pr = 1/(Y, Sb); 
(8.5.4) Se=Xe41— Xe, and yy, = Vf (Xe+1) — VF (Xx). 


The formula for By,+; is designed to satisfy By4is, = yx. 

The standard approach is to perform a line search in the direction dy = 
—B, IV J (xx), starting with a step length of one as for the Newton method. As 
for Newton methods, the Newton direction is guaranteed to be a descent direction 
if By, is positive definite and symmetric. In order to guarantee that By, is also 
symmetric positive definite, we need s} y, > 0, which does not hold in general for 
non-convex f.To ensure this, Wolfe condition-based line searches are typically used. 
The curvature condition (8.3.9) implies 


Vi (ns)! wep — xe) > Oo VER) e41 — XK), 80 


YiSk = (C2 — V5 VER) dy > 0 


provided 0 < c) < 1. Thus Wolfe condition-based line searches can ensure that the 
By remain symmetric positive definite provided Bo is. 

The DFP method can be made more efficient by introducing an update formula 
for H, = By I 


LA, 
C T 
(8.5.5) Aya, = Ah - a rn + PKSKS;,- 


This update formula can be deduced through the Sherman—Morrison formula (2.1.16) 
applied twice. Note that Hy4i1y, = Sx. 

There is a symmetry between B; and Hy, and between s; and y,. Swapping both 
B, and Hy, and sx, and y, gives new formulas for B,,,; and H;,,, satisfying the 
equations Byiis, = y, and Ayiiy, = Sx. This gives new update formulas that in 
fact perform better than Davidon’s original choice. These update formulas are the 
Broyden—Fletcher—Goldfarb—Shanno (BFGS) quasi-Newton update formulas: 


? Davidon is perhaps best known for participating in the March 8, 1971, FBI office break-in in 
Media, Pennsylvania, and for releasing the documents obtained to the press [176]. 
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Bysi8} Be T 
— + Pk ; 
st Bis, PRY KY 


(8.5.7) Aga = (1 = peseyp) Ae — pryp8,) + pesisz - 


(8.5.6) Bury = Be- 


Both the DFP and BFGS methods give superlinear convergence to a local minimizer 
with a positive-definite Hessian matrix, but the BFGS methods seem to be more 
robust. 

There are other update formulas that can be used. In particular there is the Broyden 
family of updates: 


(8.5.8) Buy = (1 — dy) BREDS + oy Ber. 


with 0 < ¢% <1. Global convergence results for the BFGS method can be 
proven assuming that there are bounds [max > Amax(Hess f(x)) and 0 < min < 
Amin (Hess f (x)) on the maximum and minimum eigenvalues of the Hessian matri- 
ces of f [190, Thm. 6.5]. The key to the proof is showing that trace(B,) — In det B, 
cannot increase “too quickly” as k — oo. This means that neither B, nor A, = B, ; 


can become large “too quickly”. 


Exercises. 


(1) Implement the Polak—Ribiére—plus conjugate gradient method with 
Wolfe condition-based line search, and apply it to the function 
f@, y) =x? +xy*—10xy+4x+y+—2y with initial point xo = 
(x9, Yo) = (0,0). Use default values for c,; and cy) in the Wolfe 
conditions. 

(2) M.J.D. Powell stated in [207] that even for convex quadratic objective functions 
and exact line searches, the Fletcher-Reeve conjugate gradient method has only 
linear convergence and fails to have n-step exact convergence if po is not in 


the direction of —r9 = —V f (xo). Implement this method for f(x) = 5x" x, 
x € R” but modified so that you can control py independently of xo. Report the 
angle between p, and —r; = —V f (xx) as well as the sequence of values f (xx) 
for all k. 


(3) The conjugate gradient method is suitable for many large-scale problems as 
its memory requirements are small, and only function and gradient evaluations 
are required. However, it relies on a line search to approximately minimize 
Sf (x, + 5p;) overs > 0,and Wolfe condition-based line searches or similar must 
be used to ensure that the method converges. This is a particularly important issue 
in machine learning where even one pass over the data is expensive. An alternative 
is to pick so > 0, compute f (xx, + sop;,), and compute the quadratic interpolant 
¢ of the data f (x;), PLV f(xy), and f (x, + sop,) for d(0), @’(0), and #(s9). Set 


Ss be to be the minimizer of ¢. Accept s; = se? if the Wolfe conditions are satisfied 


ats = s" , and otherwise perform the Wolfe condition-based line search. Also 
set So for step k to be 2 s,_; where s;,_; is the accepted step size for step k — 1. 


Implement this method and compare with the default line search method used. 
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(4) Re-create Figure 8.5.1 using the data provided in Section 8.5.1.1. Try varying 
the starting point for the algorithms to see if the behavior of the methods is robust 
to these changes. 

(5) A Suppose that the BFGS quasi-Newton method (see (8.5.6, 8.5.7)) is applied 
to f(x) = $x’ Ax + b’x with A symmetric positive definite using exact line 
searches. Show that if Bp = J then the search directions d,, k = 1,2,..., are 
A-conjugate (that is, dj Ad, = Oif j # k). 

(6) Implement the BFGS method with Wolfe-condition based line search, and apply 
the method to minimizing f(x, y) = x7 + xy? — 10xy + 4x + y* —2y with 
initial point x» = (xo, yo) = (0, 0). Use default values for the Wolfe condition 
parameters c; and cz. 

(7) Use the Shermann—Morrisson (2.1.17) or Sherman—Morrisson—Woodbury 
(2.1.17) formula to derive the update (8.5.7) for H, = By ' from the update 
(8.5.6) for B, for the BFGS method. 

(8) Show that in the BFGS update (8.5.7) for Ay; in terms of Ay, that if Hy is 
positive definite and 1/p, = y} s; > 0, then H;, is also positive definite. [Hint: 
First show that H;.1 is positive definite. Then see what conditions on z hold if 
f Ase = 0) 


8.6 Constrained Optimization 


Constrained optimization can be represented most abstractly in terms of a feasible 
set, often denoted Q C R”: 


(8.6.1) min f(x) — subject tox € Q. 


Solutions exist if f is continuous and either Q is a compact (closed and bounded) 
subset of R”, or if Q is closed and f is coercive. Usually Q2 is represented by equations 
and inequalities: 


(8.6.2) Q={x eR" | g(x) =0 fori € €, and g(x) > Ofori eT}. 


If Z is empty but € is not empty, then we say (8.6.1) is an equality constrained 
optimization problem. If Z is non-empty, we say (8.6.1) is an inequality constrained 
optimization problem. 

For a general constrained optimization problem, first-order conditions can be 
given in terms of the tangent cone 
(8.6.3) 


Tats) = { jim = | xy Q, xX, > xask > ow, and i L0ask > oof 


8.6 Constrained Optimization 591 


Lemma 8.19 [fx = x* minimizes f(x) over x € Q and f is differentiable at x”, 
then 


(8.6.4) Vi(x*)'d>0  foralld € To(x*). 


Proof Suppose x = x* € Q minimizes f(x) over x € Q and f is differentiable. 
Then for any d € Tp(x*), there is a sequence x, > x* as k > 00 with x, € Q 
where d, := (x; — x*)/t, > dask — oo. Since f(x") < f(x,) = f(x* + td), 


0 es | m Sf (x* + ted) ~ f(x*) = Vf (x*)? Jim d, = Via) a. 


i 
=F OO t 


This holds for any d € Tg(x*) showing (8.6.4), as we wanted. 


Constraint qualifications relate the tangent cone Tg(x) to the linearizations of the 
constraint functions: 


Co(x) = {d ER" | Vgi(x)'d =O foralli cE, 
Vgi(x)'d > 0 for alli € Z where gi(x) = 0}. 


For equality constrained optimization (Z = #), the LICQ (8.1.2) implies that 
Te (x) = Ce(x) as noted in Section 8.1.3. 
8.6.1 Equality Constrained Optimization 


The theory of Section 8.1.3 for Lagrange multipliers and equality constrained opti- 
mization (8.1.5) can be immediately turned into a numerical method. To solve 


(8.6.5) 0=Vif(x)— >) AVe(x) 
igE 
(8.6.6) O=g(x), ie€ 


for (x, A) with A = [\; | i € E] we can apply, for example, Newton’s method. For 
unconstrained optimization, we can then perform a line search to ensure that the step 
improves the solution estimate. The issue in constrained optimization is that f(x) 
alone is no longer suitable for measuring improvements. Constrained optimization 
problems have two objectives: staying on the feasible set, and minimizing f(x). It 
may be necessary to increase f(x) in order to return to the feasible set. Solving the 
Newton equations for (8.6.5, 8.6.6) gives a direction d. Because of the curvature of 
the feasible set Q for general functions g;, moving in the direction d even if x is 
feasible may take the point x + sd off the feasible set. This can be offset by having 
a second order correction step to move back toward the feasible set. This second 
order correction uses a least squares version of Newton’s method to solve g(x) = 0. 
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Algorithm 82 Basic Newton-based equality constrained optimization method 
1 function optnewtonequality(f, Vf, Hess f, g, Vg, Hess g, x, A, a, €, 1) 


2 while ||V f(x) —Vg(x) Al] > and |lg(x)ll > € 
3 H <Hess f(x) — Vice \iHess gi (x) 
4 Be 


max(n, —2Amin(H)) // ensures H+(I/ is positive definite 


P seid H+ 61 ve") P| _ ae | 
Vg(x) 0 or g(x) 

6 s <1; accept < false 

7 while not accept 

8 xt<extsdx; ATH—A+56Xr 

9 factor Va(xt)? = Q,R, // reduced QR fact’n 

10 xt xt —R7'O? g(xt) // 2nd order correction 

11 define ma(z)= f(z) +a ice lgi(@)| // merit function 

12 LE ma(xt) < mo(x) +1 Ving (x)? (xt — x) 

a3 accept < true 

14 else 

15 s<—s/2 

16 end if 

17 end while 

18 x<xt 

19 end while 

20 return (x, A) 

21 end function 


Since this is an under-determined system for |E| <n, we find the solution 6x for 
Vei(x)' dx = —g;(x), i € €, that minimizes ||dx||,, which can be done using the 
QR factorization of [Vg;(x) | i € E]. 

For line search algorithms, we can use a merit function to determine the quality of 
the result of the step. Often, merit functions of the formx +> f(x) +a); <¢ lgi(x)| 
are used where a > maxjeg¢ |A;|. A basic method for solving equality constrained 
optimization problems is shown in Algorithm 82. 

If the second-order correction is skipped, then the Newton method may fail to 
give rapid convergence, as was noted by N. Maratos in his PhD thesis [170]. 


8.6.2 Inequality Constrained Optimization 


Inequality constrained optimization is more complex, both in theory and practice. 
The theorem giving necessary conditions for inequality constrained optimization 
was only discovered in the middle of the twentieth century, while Lagrange used 
Lagrange multipliers in his Mécanique Analytique [151] (1788-1789). The neces- 
sary conditions for inequality constrained optimization are called Kuhn—Tucker or 
Karush—Kuhn-Tucker conditions. The first journal publication with these conditions 
was a paper by Kuhn and Tucker in [149] (1951), although the essence of these 
conditions was contained in an unpublished Master’s thesis of Karush [139] (1939). 
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The work of Kuhn and Tucker was intended to build on the work of G. Dantzig and 
others [69] on linear programming: 


(8.6.7) mine’ x subject to 


(8.6.8) Ax > b, 


where “a > b” is understood to mean “a; > b; for all i”. 

It was Dantzig who created the simplex algorithm in 1946 [69], being the first 
general-purpose and efficient algorithm for solving linear programs (8.6.7, 8.6.8). 
The simplex method can be considered an example of an active set method as it 
tracks which of the inequalities (Ax); > b; is actually an equality as it updates the 
candidate optimizer x. Since then there has been a great deal of work on alternative 
methods, most notably interior point methods that typically minimize a sequence of 


penalized problems such as 


cx —a > In((Ax); — Bj) 


i=1 


where a > 0 is a parameter that is reduced to zero in the limit. The first published 
interior point method was due to Karmarkar [138] (1984). Another approach is the 
ellipsoidal method of Khachiyan [120] (1979), which at each step k minimizes c7 x 
over x lying inside an ellipsoid centered at x; that is guaranteed to be inside the 
feasible set {x | Ax > b}. Khachiyan’s ellipsoidal method built on previous ideas 
of N.Z. Shor but was the first guaranteed polynomial time algorithm for linear pro- 
gramming. Karmarkar’s algorithm also guaranteed polynomial time, but was much 
faster in practice than Khachiyan’s method and the first algorithm to have a better 
time than the simplex method on average. 


8.6.2.1. The Farkas Alternative 


An important result for constrained optimization is the Farkas alternative: 


Theorem 8.20 (Farkas alternative) Given real matrices A and B and a vector b, 
then either (exclusive) 


e there are vectors u > 0 and v where b = Au + Bv, or 
e there is a vector y > 0 where y’ A > 0, y'B = Oand y’b <0. 


Proof (Outline) The set C :={Au+ Bv|u > 0} is a closed convex set. The 
fact that it is convex follows from the fact that it is the image of the convex set 
{ (u, v) | u > 0} under a linear transformation. The fact that C is closed is a more 
technical condition which is shown in, for example, Nocedal and Wright [190, 
Lemma 12.15, p. 350] using an argument of R. Byrd. 
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The separating hyperplane theorem (Theorem 8.13) then implies that if b ¢ C 
then there is a vector n where n’b < 0 and n’ w > 0 for every w € C. So either 
b = Au + Bv for some u > 0 and v (that is, b € C) or there is n where n’b < 0 
and n’ (Au + Bv) > 0 for every u > 0 and v (for b ¢ C). 

In the case where b ¢ C, we first set v = 0 so thatn? Au > 0 forall u > 0, which 
implies that n? A > 0. Now consider u = 0 so that n’? Bv > 0 for all v. Replacing v 
with —v (which is allowed since v can be any vector of the right dimension), we get 
n’ Bv <0 forall v. Thus n? Bv = 0 for all v. Since this is true for all vectors v we 
get n’ B = 0. Setting y = n gives y’ A > O and y’ B = O while y’b < 0. 

Thus if b € C we obtain the first alternative; if b ¢ C we obtain the second alterna- 
tive. The two conditions are exclusive because b = Au + Bv withu > 0, y’A > 0 
and y’B = 0, give0 > y’b= y? Au+ y’ Bv = 0 which is impossible. Thus we 
have established the Farkas alternative. 


8.6.2.2 Proving the Karush-Kuhn-Tucker Conditions 


To prove the Karush—Kuhn—Tucker conditions we need a constraint qualification to 
ensure that 
Ta(x) = {d | Vgi(x)'d =0 foralli €€, 
(8.6.9) Vgi(x)'d => 0 for alli ¢ Z where g;(x) = 0} = Ca(x). 
A constraint g;(x) > Ois called active at x if g;(x) = O and inactive at x if g;(x) > 0. 


Inactive inequality constraints at x do not affect the shape of the feasible set Q near 
to x. We designate the set of active constraints by 


(8.6.10) A(x) = {i |i € € UT and g;(x) =0}. 


The equivalence (8.6.9) holds under a number of constraint qualifications, the most 
used of which is the Linear Independence Constraint Qualification (LICQ) for 
inequality constrained optimization: 


(8.6.11) { Vgi(x) |i € A(x)} isa linearly independent set. 


Weaker constraint qualifications that guarantee (8.6.9) include the Mangasarian— 
Fromowitz constraint qualification (MFCQ): 


{Vgi(x) |i €¢ €} isa linearly independent set, and 
there is d where Vg;(x)'d = 0 for alli € €, and 
(8.6.12) Vei(x)'d > 0 foralli e TN A(x). 


With a suitable constraint qualification, we can prove the existence of Lagrange 
multipliers satisfying the Karush-Kuhn—Tucker conditions. 
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Theorem 8.21 Suppose (8.6.9) holds for all x in the feasible set 
Q={x | g(x) =0 fori € €, g(x) > Ofori eT}. 


Then if x* minimizes f (x) subject to x € Q, there are Lagrange multipliers \; for 
i € EUT where 


(8.6.13) O=Vf(x*)— D> A; Vai(x*), 

icEUL 
(8.6.14) 0 < Xj, gi(x*) and d; g)(x*)=0  foralli €T, 
(8.6.15) £ 2G. 


Proof Suppose x* minimizes f(x) subject to x € Q. Then by Lemma 8.19, there is 
nod € Tg(x*) where V f (x*)"d < 0. By (8.6.9), there is nod where V f (x*)’d <0 
and 


0=Veg(x*)'d  foralli €€, 
0<Ve(x*)'d  foralli e ZN A(x*). 


If we set A = [Vg;(x*) |i € ZN A(x*)] and B= [Vg;(x*) |i € E], from the 
Farkas alternative (Theorem 8.20), we must take the first alternative. That is, there 
must be yp > 0 and v where V f (x*) = Aw + Bv. Let d; = py; fori € TN A(x*), 
and \; = 4; ifi € €. Forinactive inequality constraints (i € Z\.A(x*)), we set A; = 0. 
Then we see that (8.6.13) holds: 


Vix") = D> A Vei(x*). 


ie€UT 


Also A; > 0 for all i € Z. Since x* is feasible, g;(x*) > 0 for all i € Z. Since 
gi(x*) > Oimpliesi ¢ A(x*), we have A; = 0, so that A; g;(x*) = 0. Thus (8.6.14) 
holds. Finally, (8.6.15) holds because x* must be feasible to be a constrained mini- 
mizer. Thus all of the Karush-Kuhn—Tucker conditions (8.6.13—8.6.15) hold, as we 
wanted. 


8.6.2.3 Algorithms for Inequality Constrained Optimization 


There is a wide range of algorithms for inequality constrained optimization. 

Some algorithms add “slack variables” to turn inequality constrained optimization 
problems to create equivalent equality constrained problems: replace “g;(x) > 0” 
with “g;(x) — Ss = 0” where s; is a new “slack” variable. The danger here is that if 
the MFCQ constraint qualification (8.6.12) holds, then the new equality constraints 
are likely to violate the LICQ for equality constrained optimization problems. 
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Some algorithms are set up like the interior point methods for linear programming: 
each inequality constraint g;(x) > Ois turned into a penalty term, so that the objective 
function becomes 

fe) =e > In gi (x), 


ieL 


with a being reduced down toward zero in a stepwise fashion. 
Some algorithms use linearizations of the constraints along with second order 
Taylor polynomials for the objective function: 


1 
(8.6.16) min f(x.) + Vf (xy)! p+ 5 HessrL (x4, Ax) p 
Pp 
subject to 
(8.6.17) g(x,y) + Vgi(x,y’ p=0  foralli ec €, 
(8.6.18) gi(x,) + Vgi(x,)' p>O0  foralli eZ. 


This is called the Successive Quadratic Programming method or SQP method. Ditf- 
ficulties can arise here if the linearized problem is infeasible. Each subproblem 
(8.6.16—-8.6.18) is a quadratic program: 


ee 1 + 
(8.6.19) min g z+ 5% Bz 

z 
(8.6.20) subject to Acz = De, 
(8.6.21) Azz => br. 


Convex quadratic programs where B is positive semi-definite can be solved in a finite 
process similar to the simplex method [95, 185]. It is possible that the constraints 
(8.6.20, 8.6.21) may be inconsistent, but the algorithms for quadratic programs can 
detect this. This can also be avoided by using non-smooth penalties instead of inequal- 
ity constraints: instead of aconstraint a7 z > b, we addapenalty M max(b — a’ z, 0) 
with M positive and large. Although this function is clearly nonlinear in z, it can be 
represented by adding a slack variable s satisfying the linear inequalities s > 0 and 
s > b —a"z; the penalty term added to the objective is M s. This gives an alternative 
quadratic program 


1 
oT T 
min g'z+52'Bz+M Yo si 
ieEUT 
subject to s; > +(bj — a} z) for alli € €, 
sj > —(bj — a] z) for alli € €, 


5 > Dj —alz for alli € Z, 
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which always has a solution. The solution of this modified quadratic program is also 
equal to the solution of (8.6.19-8.6.21) provided the original quadratic program has 
a solution and M is greater than the sum of the absolute values of the Lagrange 
multipliers. 

SQP methods typically use a “merit function” to represent the quality of the 
solution for purposes of line searches or trust regions, such as 


mo(x) = f(x) +a bp gi(x)| + > max(—gi(x), 0) 


icE ieL 


similar to the function m, in Algorithm 82. SQP methods also need to take into 
account the curvature of the constraints by incorporating a correction step similar 
to that for equality constrained problems, to restore the constraints. This constraint 
correction step should be taken before the merit function is evaluated, to determine 
the step length. 


Exercises. 


(1) Show that (8.6.9) holds if all the constraint functions are affine: g;(x) = a/x + 
b;. This is the affine constraint qualification. Use this to show that Lagrange mul- 
tipliers exist for linear programs: min, c’ x subject to Ax > b (componentwise). 

(2) The paper of Kuhn and Tucker [149] gives an example where constraint qualifi- 
cations fail, and no Lagrange multiplier exists: min x over all (x, y) € R? subject 
toy > Oandy < x>. Show that the minimizer is (x, y) = (0, 0). Then show that 
no Lagrange multiplier exists at this point. 

(3) As in the previous Exercise, f(x) = x’ Ax with A symmetric. Now consider 

the problem min, f(x) subject to the condition that x’x < 1. With the same 

Lagrangian, what property does the Lagrange multiplier have? Express your 

property in terms of the eigenvalues of A. 

Prove that the Mangasarian—Fromowitz constraint qualification (8.6.12) implies 

(8.6.9). [Hint: First show that if Vg; (x)" p > 0 for alli where g;(x) = 0 implies 

p € Tg(x). Now suppose that Vg;(x)"d > 0 and Vg;(x)"d > 0 for all i where 

gi(x) = 0. Show, therefore, that V g; (x) (d + ed) > Oif g;(x) = 0 for any € > 

0, so d + ed € Tcp(x). Finish by using the fact that Tp (x) is closed to conclude 

that d € Tg (x).] 

(5) Suppose that all the constraint functions are concave; that is, —g; is convex, for 

all i. Slater’s constraint qualification is that there is ¥ such that g;(x) > 0 for 

all i. Show that under these conditions, (8.6.9) holds. [Hint: Set d =X — x. 

Now 0 < gi (X) < g;(x) + Vgi(x)" (* — x) since g; is concave. Conclude that 

Vei(x)'d > 0 if g;(x) = 0. Now use the fact that (8.6.12) implies (8.6.9) (see 

the previous Exercise).] 

Find the global minimizer of f(x, y) = x4 + x?y + y* — x — y subject to x? + 

y? = 1. Note that there are multiple critical points. Symbolic solvers can find all 

solutions of the Lagrange multiplier equations. 


(4 


wm 


(6 


wm 
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(7) Inequality constraints can be turned into equality constraints with the help of a 
“slack” variable. Consider the problem min, f(x) subject to g(x) > 0. This is 
equivalent to min, , f(x) subject to g(x) — s* = 0. Use the first- and second- 
order necessary conditions for equality constrained version to obtain the KKT 
conditions (8.6.13—8.6.15). 


(8) Solve the following constrained optimization problem: 


wm 


mine-*—xy+y? subject to 
xy 


x,y =O, 
x+y <2. 


Compute the Lagrange multipliers for your solution. Check that the KKT condi- 
tions hold at your solution. [Note: You may need to solve a nonlinear equation in 
one variable numerically.] Check the second-order conditions for your solution 
as well. 


(9) Suppose that x* is a KKT point for the minimization problem 


~ 


min f(x) — subject to 


gi(x)>0O fori =1,2,...,m. 


Show that if f and —g; are convex functions for i = 1,2,...,m, then x* is 
a global minimum for the constrained optimization problem. [Hint: Show that 
f(x) — 07, Xi gi(x) is a convex function of x where A; are the Lagrange 
multipliers at x*.] 

(10) In a sense, constrained optimization problems min, f(x) subject to g(x) = 
0 € R” have two objectives: to minimize f(x), and to satisfy the constraints 
g(x) = 0. One way of combine these two is in a merit function m,(x) = f(x) + 
a >>", |gi(x)|. Show that if x* is a strict local constrained minimizer, then x* 
is a local (unconstrained) minimizer of F,, provided 4. > max;=)....,m |Ai| where 
A is the Lagrange multiplier at x*. 

(11) N. Maratos [170] showed that if a merit function like m,, of the previous Exercise 
is used, then a second- order correction or something similar should also be used. 
Consider the problem min,,, x subject to x24 y? = 1, and a = 2. Show that 
if x, = (Xx, Ve) = (cos O, sin Oy) with 6, ~*~ 7, then the Newton step d; for the 
Lagrange multiplier and constraint equations, will not reduce the merit function: 
M(x; + d;,) > m_(x,). From this, explain the justification for the second-order 
correction step. 

(12) Consider the minimax problem: min, f(x) + max {g1(%), g2(¥), ..-, 8m(x)} 
with all functions f and g; smooth. Show that this can be represented as a 
smoothly constrained optimization problem: 
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min f(x)+s5 — subject to 


s>gi(x), i=1,2,...,m. 


Show also that this formulation satisfies the Mangasarian—Fromowitz constraint 
qualification (8.6.12). 


Project 

An optimal control problem is a differential equation involving a “control” variable 
that can be changed at will over time to minimize a certain objective function, subject 
to constraints on the control function. See, for example, [35, 142, 162] for more 
information about optimal control theory in general. A standard form is: 


aa g(x(T)) — subject to 
u(-),x(- 


d. 
a0) = f(t, x(t), ut), x(t) = x0, 
u(t)eU _ forallt. 


The set U should be a convex set. The differential equation can be discretized by, for 
example, Euler method (6.1.8): x, © x(t.) with % = to + kh where 


(8.6.22) Xep1 = Xe +h f (th, Xk, ux), k=0,1,2,...,N—1. 


Each of these constraints has a Lagrange multiplier A, and we define the Lagrangian 


N-1 


L(x, u, A) = g(ew) — D0 AG [Xep1 — te — A fe, Xe, Ux) ] - 
k=0 


If we set the linearization of L(x + 6x, u, A) — L(x, u, A) to zero and noting that 
6X9 = 0 as XQ fixed, we obtain the equations for the A;: 


(8.6.23) Ve(xv) = An-1, 
(8.6.24) Art = Ac +hVa f (tes Xk, UE), k=1,2,...,N—-1. 


The gradient of the objective function g(x) with respect to the uz, constrained by 
(8.6.22) is given by the linearized change 


N- 
Sig(xn]=h SY ALVuaf tk, Xk, Ue) SUE. 


e 


>= 


To perform the optimization, we use a gradient projection algorithm: take a step in 
the negative gradient direction, then project that back to the feasible set, which in 
this case isu, € U fork =0,1,2,...,N—1: 
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(8.6.25) 
ut < projy (uy —shVuf (te, Xk, UA) AR),  &=0,1,2,...,N—1, 


where proj,,(w) returns the nearest point in U to w. The step length s can either 
be constant (chosen after some experimentation) or chosen according to a sufficient 
decrease criterion based on the objective function value after projection using u;. 
The inner parts of the algorithm are: (1) simulate the dynamics for x; given ux, by 
(8.6.22); (2) compute the Ax using (8.6.23, 8.6.24) going backwards in k; (3) perform 
the gradient projection step (8.6.25); and (4) accept or reject uj (and update s) 
according to the sufficient decrease criterion (optional). 

Use this to approximately solve the optimal harvesting problem: max m(T) where 


om = dm+u(t) f, m(0) = 0, 
df | b 2 0) =a/b 
eel f° — u(t) f, fO) =a/b, 


subject to u(t) € [0, 0.20]. Use T = 100, a = 0.05, b = 0.02, 6 = 0.10. In this, 
f(t) is the quantity of fish in the sea, m(t) is the amount of money in fishermen’s 
bank accounts, 6 is the interest rate at the bank, a is the rate of reproduction of fish 
(at low populations), and u(t) represents the fishing effort. Re-run with 6 = 0.05 and 
a = 0.10. 


Appendix A 
What You Need from Analysis 


A.1 Banach and Hilbert Spaces 


A.l.1 Normed Spaces and Completeness 


A vector space is a collection V of vectors v where there is vector addition v + w 
and scalar multiplication sv for s a real number (s € R or possibly C) satisfying the 
usual commutative, associative, and distributive properties: 


u+(v+w)=(ut+v)+u, 
utv=v-+4, 
r(sv) = (rs)v, 
(r+s)v=rv+sv, 
ruu+v)=ru+ro, 
v+0=v=0+40, 
Ov = 0, lvu=v. 


The simplest examples of interest to us are R”, the set of n-dimensional vectors, 
represented by columns of n real numbers: 


Addition is understood componentwise, scalar multiplication is applied to each entry, 
and the zero vector is the vector where each entry is zero. Functions f: D — R also 
form a vector space with addition defined by (f + g)(x) = f(x) + g(x) and scalar 
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multiplication by (s - f)(x) =s- f(x). Matrices with m rows and n columns also 
form a vector space with entrywise addition and scalar multiplication. 

A normed vector space is a vector space V with a norm ||-|| with the usual prop- 
erties of norms: 


e ||v|| => 0 and ||v|| = 0 implies v = 0, 
e ||sv|| = |s| llv|], and 
e |jv+ wl] < |u| + |v]. 


A sequence of vectors v;, i = 1, 2,3,..., is a Cauchy sequence if for every e > 0 
there is an N where k, £ > N implies ||v, — v¢|| < €. We say v; converges to v as 
i — ooifforevery « > Othereis an N where k > N implies ||v, — v|| < €,in which 
case we Say v is the limit of the v;: lim;_,., v; = v. We also denote convergence to 
a limit by vj > v asi > oo. 

The vector space V is a complete normed space or Banach space if every Cauchy 
sequence converges. Not all normed vector spaces are complete, such as the polyno- 
mials over [0, 1] with the norm || f||,, = maxo<x<1 | f(x)|. The sequence f,(x) = 
~ x//j!is a Cauchy sequence (|| fx — fellos < ne 1/j! < 2/min(k, £)!) 
but does not converge to a polynomial; instead f, — f where f(x) = e*. Every 
normed vector space V has a completion V which is a complete vector space, and 
consists of all possible limits of Cauchy sequences in V. If V is already complete, 
then V = V. Every finite dimensional normed space is already complete. Any two 
norms of the same finite dimensional vector space are equivalent in the sense of 
(1.5.2). 

Note that C(D), the space of continuous functions D + R where D is aclosed and 
bounded set in R@, is a complete normed space with norm || f||,; = maxxep | f (x)|. 

Most Banach spaces are best defined through completions. The spaces L?(D) for 
1 < p < wand D closed but bounded and positive volume in R‘, are defined as the 
completion of C(D) with respect to the norm 


l/p 
(Fines | fircorr ar] . 


The space L®(D) is defined as the set of integrable functions on D for which 
II Fllzo(p) = inf {8 | volg ({x € D | | f(x)| > s}) = 0}. 


The W”-?(D) norm for 1 < p < oo is given by 


1/p 
Pp 


jal 
ons dx 


Ox 


(x) 


flew) =|} Yo if 


a:|a|<m 


for 1 < p < oo for f having continuous order m derivatives. The space W”’? (D) is 
the completion of all functions with continuous order m derivatives under this norm. 
For p = 00, we have 


Appendix A: What You Need from Analysis 603 


P 
dx 


ja 


‘i 
Oxo (*) 


(A.1.1) Il fll wecocp) = Max max 
a:|ja|<m xeD 


for functions with continuous order m derivatives. The space W”’*(D) is the set of 
all functions f where for every multi-index @ with |a| < m, 0! f/Ox% € L©(D). 

Functions F: V — W between normed spaces are continuous if v; — v implies 
F(v;) > F(v) asi > oo. If F is also linear (F (ru + sv) =r F(u) +s F(v) forall 
u,v € Vandr,s € R)then F is continuous if and only if sup {|| F(v) || | llvlly < 1} 
is finite. In fact, there is the norm for linear functions V — W given by 


(A.1.2) IF llvw = sup {llFO)|lw | llelly < 1}. 


A.l.2. Inner Products 


An inner product space V is a vector space witha real inner product (-, -): V x V > 
R which for real scalars has the properties: 


e (v, v) > 0 for all v and (v, v) = 0 implies v = 0 
e (v, w) = (w, v) forall v and w, 
e (u, rv+sw)=r(u, v)+s(u, w) for all v and w, and scalars r,s € R. 


Complex inner product spaces have an inner product with the following properties: 


e (v, v) > 0 for all v and (v, v) = O implies v = 0 
e (v, w) = (uw, v) forall v and w, 
e (uv, ru+sw)=r(u, v)+s(u, w) for all v and w, and scalars r,s € C, 


where Z is the complex conjugate of z € C. 

Standard examples include R” with the inner product (x, y) = x’ y = ci XjVjs 
and continuous functions D —> R where D is a region in R¢ with positive volume. 
Then we can define (f, g) = J, p J (Xx) g(x) dx as an inner product on continuous 
functions on D. For complex vectors we define (x, y) =x’ y = ee xX; yj;3 for 
complex functions we usually use (f, g) = if F (x) g(x) dx. 

Inner products define a norm: ||v|| = ./(v, v). This is the inner product norm. If D 
is aclosed and bounded subset of R“, then C(D), the space of continuous functions 
D — R, is an inner product space with (f, g) = is Ff (x) g(x) dx. However, this 
space is not complete with respect to the inner product norm || f ||, = /Cf, f) = 


ral pt (x)? dx. Instead the completion of this space with respect to this norm is 


denoted L?(D), the set of square-integrable functions f: D — R. Complete inner 
product spaces are called Hilbert spaces. 
The Cauchy—Schwarz inequality applies to all inner product spaces: 


(A.1.3) \(a, b)| < |la|| l151| 
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where the norm |la|| = /(a, a). In the case where a, b € R” and (a, b) = a" b we 


have 
F 127 » 1/2 
eT 
i=l i=l 


for f, g € L?(D) there is the integral version: 
1/2 1/2 
=| f rorar] [f ewrar] 
D D 


A.l.3 Dual Spaces and Weak Convergence 


n 


ye a;b; 


i=1 


(A.1.4) 


(A.1.5) if f(x) g(x) dx 
D 


A linear functional £ is a continuous linear function 2: V — R where V is a normed 
space. The norm of a linear functional is 


|2ll = sup {/€(v)| | llully < 1}. 


With this norm, the linear functionals on V is another normed space V’, which is 
called the dual of V. If V is a Banach space, so is V’. For V = R", we can identify 
the dual space (IR")’ with R”; we can think of vectors x in R” as column vectors, and 
vectors y in (R”)’ as row vectors so that y(x) = y x using regular matrix operations 
to multiply a row vector by acolumn vector. We can identify x € R" withx’ € (R"). 
For complex vectors, we identify x € C” with ¥” € (C”). 

Note that if V is a real inner product space, w +> (v, w) is a linear functional on 
V. If V is a complete inner product space (that is, Hilbert space), then every linear 
functional on V can be represented in this way. 

We say that v; converges weakly to v as i > oo (denoted v; — v as i > &) 
if £(v;) > £(v) as i > on for every linear functional 2. We say that a sequence of 
linear functionals ¢ ; converges weakly* to £ asi — oo (denoted £; +* £as j > oo) 
if £;(v) > €(v) as j > oo for every v € V. 

If A: V — W is a continuous linear function between Banach spaces, then the 
adjoint to A, denoted A*: W’ > V’, is defined by A*(£)(v) = €(Av) for ve V 
where ¢ is a linear functional on W. If A is represented by an m x n matrix (for 
A: R" — R") then identifying the dual space (R")’ with R", we have A* = A’. For 
A: C” > C” then A* = A: Note that if A is a bounded linear function, so is A*: 
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[4° cv = sup{]A*O]]y, [ally <1} 
= sup { |A*(2)(v)| | Helly. < 1, llvlly <1} 
= sup {|€(Av)| | [Illy <1, |lully <1} 
< sup {|léllv Allysw llelly | Welly <1, llully < 1} 


< ||Allyow- 


In fact, | A* lv = IlAlly—w- 

There is a linear function J: V > V” (V” is the dual of the dual of V) given by 
J(v)(€) = €(v) where £ € V’. The Banach space V is reflexive if range(J) = V”. 
Note that both R” and C” are reflexive, and we identify (R”)” with R” by (x7) "=x 
and (C”)” with C” by ue =x. 

In reflexive spaces, V and V” can be identified, and weak and weak* convergence 
are identical. The W”’?(D) spaces are reflexive for 1 < p < ow. The space C(D) 
of continuous functions D —> R is not reflexive. In fact, the dual space to C(D) can 
be identified with the space of finite signed Borel measures ju with functionals being 
represented via integrals with respect to ju: 


(A.1.6) ef) = [ f(x) u(dx). 


The Dirac “d-function” is actually a measure concentrated at zero with the linear 
functional: 


fr fO= is f (x) 6(x) dx. 


If W is a subspace of V then W+ is the set of all £ € V’ where £(w) = 0 for all 
w € W. If V is an inner product space then we can identify V and V’ through the 
inner product, and 


Wt ={veV|(v,w) = Oforall we W} 


is the orthogonal complement of W in V. 


A.2 Distributions and Fourier Transforms 


A.2.1 Distributions and Measures 


Distributions extend the idea of Dirac “d-functions” and measures, which are given 
meaning as linear functionals on spaces of continuous functions: 
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pr is p(x) u(dx). 


The Dirac 6-function is the measure that represents the functional y (0). Dis- 
tributions generalize the idea of measure as follows: distributions are dual spaces to 
spaces of “nice” functions. More specifically, they are continuous linear functionals 
on the space of y with the following properties: 


e derivatives 0'*!y/Ox% exist everywhere for every multi-index a; 
e the set {x ER? | pa) F 0} is a bounded set. 


This is the space of test functions S(R“). This is a vector space of functions. We do 
not give this space a single norm, but instead an infinite family of norms: 


Ally 


; ia ae ere 
Ox : 


WO) eS ee 
( ) elle (R4) xeR? a:|al|<r 


(x) 


Continuous functions on S(R“) can depend not only just values of the function but 
also values of derivatives, and integrals of derivatives of whatever order we wish, 
as long as no functional uses an unbounded number of derivatives. Distributions are 
continuous linear functionals on S(R“), denoted S(R¢)’. As for S(R®), there is no 
norm for S(R¢)’. A conventional function f : R — R that is integrable is represented 
by the distribution given by 


(A.2.2) ye S(RI) 6 2 f(x) v(x) dx. 
Rd 


Derivatives can be defined on distributions, even where the corresponding functions 
are not differentiable via integration by parts: in one dimension, 


+00 +00 


f'@) px) dx = f ax)p@)hatS — F(x) px) dx 


+00 


—— f(x) o' (x) dx. 


We use the notation ¢[y] to represent the application of a distribution (as a linear 
functional) to a test function y € S(R%), to distinguish it from ordinary function 
evaluation. Thus if £ € S(R)’ is a linear functional representing a function f, then 
the functional of the derivative f’ is 


L(y) = —£(Y’), 


which is well defined because for any y € S(R), vy’ € S(R) as well. For £ € S(R?)’ 
we define the linear functional 0£/0x; as 


Appendix A: What You Need from Analysis 607 


Oe Oy 
=—-£ 
Ox; ) ( Ox; 


). 


Higher derivatives can be treated in the same way. The main issue with these defini- 
tions and manipulations is interpreting what the resulting derivative is. For example, 
while the Heaviside function 


1, ifx>0, 


H = 
(x) " fee 


is a regular function for which the linear functional 


+00 


Aly] = H (x) p(x) dx =| p(x) dx 


—o0o 


is well defined for every test function y € S(R), the derivative is given by 
[o.e) 
A'(y| = -H[¢'] = -| g(x) = —(y(co) — vO) = v0), 
0 


which matches the Dirac 6-function: d[y] = y(0). That is, H’ = 6. Thus H’ is a 
measure. On the other hand, the derivative of the Dirac 6-function 


é[y] = —d[y'] = —y'(0) 


is not a measure as it is not a continuous functional on functions that are only 
continuous. 


A.2.2 Fourier Transforms 


The Fourier transform of a function f: R > R or R > C is given by 


+00 
(A.2.3) FeO= / e'& F(x) dx, 
or for functions on R@, 
(A.2.4) FeO)= / ei§'* F(x) dx, 
R¢ 


This is defined as long as i= | f(x)| dx is finite, although the definition can be 
extended to functions where this is not true. We can do this in the spirit of distributions, 
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but we need to modify the kind of test functions we work with: a tempered test function 
is a function y where 


e derivatives 0'*!y/Ox% exist everywhere for every multi-index a; 
e x8 Ally/Ax%(x) — Oas ||x|| > 00 for all multi-indexes a and 3. 


Such functions constitute the space of tempered test functions denoted T(R¢). It 
can be shown fairly easily that F: T (R?) — T(R?). Furthermore, we can prove a 
number of properties of the Fourier transform that hold for every y € T(R®), and 
many other functions besides: 


Op - 
Fe (§) = 18) Fe(§), 
3) 
F [x xjy(x)] © =iz—Fe), 
Ogi 


Fly —@l (© =e $F), 
F [x & ylax)] (€) =a“ Fy(€/a). 


The most important formula is the Fourier inversion formula: if g = F f, then 
(A.2.5) fe) = ny“ / et*€ 9(€) dé. 
Ra 


This can be expressed more succinctly as F~! = (27)~¢ F* where F* is the adjoint 
of F with respect to the standard complex inner product (f, g) = fas SI (x) g(x) dx. 
This means that 


A26) On" FRFay= Cn! FIDE FF Fo= C2): 


The way of applying the distribution trick to Fourier transforms starts by noting 
that for an integrable, bounded function f and a tempered test function y, 


flF el = i, FO Fo dé = [ t® [ etx) dx dé 
= / p(x) i ef) d&dx (changing order of integration) 
Re Rd 
= a pix) F f(x) dx =F fly]. 


We extend this definition from functions like f to all linear functionals € on T (R®): 


Fly] = [Fy]. 
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This means we can compute Fourier transforms of 6 functions: F6(€) = 1, and 
other exotic things such as the comb function: I(x) = rez d(x —k): FIM(€) = 
(27)! T(E/(2m)) = (27) '/? Yo ey O(E — 27k). Applied to test functions ~ we get 


(20)? }° p(2rk) = FI[y] = WF ye] = >> Fei), 
keZ keZ 


which is essentially the Poisson summation formula. 
Some other properties of Fourier transforms relate to convolutions: 


(A.2.7) fre(x) = i FO) gee - y)dy. 


The most important of these is the fact that the Fourier transform of a convolution of 
two functions is the ordinary product of the Fourier transforms: 


(A.2.8) F If *8l(€) =F £(E§)-F8&). 


A similar property holds for the Fourier transform of an ordinary product of two 
functions: 


(A.2.9) FLf-gl= (ny ¢ Ff «Fe. 


A.3 Sobolev Spaces 


Sobolev spaces are families of Banach spaces of functions Q — R where Q is a 
suitable set in R@. This section is a brief summary. More details can be found in [11, 
31, 213, 245] and other sources. These spaces include all smooth functions Q > R. 
The Sobolev W”:?(Q) norms, with Q an open subset of IR“, are defined first for m 
a non-negative integer and | < p < ow: 


1/p 
m 


(A3.1) If lly2@) = [ps || D/ f(x) ||? dx for 1 < p < ow, and 
j=0 


m 


(A3.2) I fllwmsco = sup D9 | D/F@)]. 
reQ 59 


Once the norm is defined for all smooth functions Q — R, the space W"?(Q) 
is defined as the completion of the smooth functions Q — R with respect to the 
W”™-P(Q) norm. The space H’(&2) is defined to be W”?(Q), which is a Hilbert 
space. This completion can be understood in an abstract sense in terms of equivalence 
classes of Cauchy sequences, or as functions in L? ({2) with distributional derivatives 
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of order up to m in L?(Q). Functions in W™:?(Q) can be extended across R¢ by 
a linear extension operator €: W""?(Q) > W":?(R¢) provided Q has a boundary 
where the boundary can be locally represented as the graph of a Lipschitz function 
[237] (see Figure 4.3.9 for an illustration). With the extension to R4, it is possible 
to replace the W”-? IR“) norm with an equivalent weighted integral of the Fourier 
transform if p = 2: 


1/2 


Ifllanean = ny" | fue | ever ag 
j=0 


Fractional order Sobolev spaces can also be defined through Sobolev—Slobodetskii 
norms 


1/p 
oe way 
dy 


Ifllws.e(ay = a = ae — yyjpotd 


a:|a|=m 


where m = [s], the floor of s, and o = s — |s]. If p =2 and Q = R’, equivalent 
norms can be given in terms of Fourier transforms: 


1/2 
Il fll zsqaey = (20) 4? i (+ él)” Free ae] 


As might be expected, increasing order of derivatives involved or level of smooth- 
ness means stronger norms and smaller spaces: if r < s then W"?(Q) C W”?(Q). 
Furthermore, this embedding is compact: bounded sequences in W*-? (2) get mapped 
to sequences in W”? (2) that have convergent subsequences. This can be important 
for showing the existence of solutions. 

In partial differential equations, fractional order Sobolev spaces are important 
for dealing with boundary values. More specifically, if the region Q has a smooth 
boundary, then the restriction of a function in W*?(Q) to the boundary OQ is in 
W*-!/P-P (AQ). In general, for a k-dimensional submanifold M in Q, the restriction 
of a function f € W*?(Q) to M is in W*-*/P-P(M). If s — k/p <0, this restriction 
is typically not even defined. In the extreme case, if M is a single point in Q, then the 
restriction of f € W*?(Q) to M is only definedifs > d/p.Inthecase where p = oo, 
then W”’°(Q) has restrictions to W”’"*° (M) for any k-dimensional submanifold M. 

The restriction operator y: W°?(Q) > W*~!/?-P(9Q) is called the trace opera- 
tor on W*? (Q2). 

Even more important for solving partial differential equations is that every func- 
tion g € W*—!/?-P(9Q) can be extended to a function ¥ € W*:?(Q) so that yg = g. 
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